The family currently contains members of
The document is mainly about proteoQ.
Tandem mass tag (TMT) and label-free quantitation (LFQ) are commonly used methods in mass spectrometry (MS)-based protein and peptide quantification. ProteoQ is a tool specifically developed for analyzing proteomics data. It can utilize results from various search engines, including:
ProteoQ interacts with Excel spreadsheets or .csv files for dynamic sample selection, aesthetic control, and statistical modeling. It also integrates operations such as data normalization against selected rows and/or columns into user-friendly functions at the interface. These arrangements allow users to address biological questions openly using metadata and a range of data pre-processing and informatics tools while keeping ad hoc data manipulation behind the scenes. Additionally, the entire workflow is well-documented and can be easily reproduced.
The informatics framework of proteoQ consists of data processing and informatics analysis. It processes peptide spectrum match (PSM) tables from search engines. The mass analyzers used can be Thermo’s Orbitrap or Bruker’s timsTOF. Peptide and protein results are generated based on user-defined parameters for data filtration, alignment, and normalization. The package also provides a suite of tools and functionalities for statistics, informatics, and data visualization by creating wrappers around published R routines.1
To install this package, start R (latest version) and enter:
In this section of the document, we provide examples of the applications of proteoQ:
Summarization of PSM results into normalized peptide and protein data.
Visualization of quality metrics in normalized peptide and protein data.
Re-normalization of data against selected samples.
Mixed-bed normalization using full or partial data.
Removal of low-quality entries from PSM, peptide, and protein data.
The dataset used in this section corresponds to proteomics data from Mertins et al. (Martins 2018). The study involved two different breast cancer subtypes, triple negative (WHIM2) and luminal (WHIM16), from patient-derived xenograft (PDX) models. The experiments were conducted by three independent laboratories. Each site assessed samples from WHIM2 and WHIM16, which were split and labeled with 10-plex TMT tags in equal sample sizes. The experiments were repeated on different days, resulting in a total of 60 samples labeled under six 10-plex TMT experiments. The samples in each 10-plex TMT experiment were fractionated using off-line, high pH reversed-phase chromatography and analyzed using on-line LC/MS. The MS data were analyzed using search engines such as Mascot, MaxQuant, MSFragger, or Mzion.
For the purpose of this document, a companion package called
proteoQDA was created, which contains a randomly sampled 10% of the PSM
entries from the complete datasets. Please note that the document
focuses on TMT examples, while examples of LFQ analysis can be found in
the help document of proteoQ::load_expts, which is another
resource for exploring proteoQ.
The data packages, proteoQDA, should have been made available through the proteoQ installation.2
To ensure proper protein entry annotation with proteoQ, it is necessary to have the FASTA3 file(s) that were used in the database searches against the WHIM data sets. In the example provided below, the corresponding FASTA files from proteoQDA are copied to a designated database folder:
The data processing starts with the PSM table(s) obtained from search engines. The PSM files are identified based on the following naming conventions:
Mascot: The file names begin with the letter “F”, followed by digits, and end with the extension “.csv”.
MaxQuant: The files start with “msms” and end with the extension “.txt”.
MSFragger: The files start with “psm” and end with the extension “.tsv”.
Mzion: The files start with “psmQ” and end with the extension “.txt”.
You can obtain the corresponding PSM files through one of the following methods. Please choose the appropriate method based on the search engine you used:
For Mascot: Export the PSM table as a CSV file and ensure it follows the naming convention mentioned above.
For MaxQuant: Locate the “msms.txt” file generated by MaxQuant.
For MSFragger: Find the “psm.tsv” file obtained from MSFragger.
For Mzion: Identify the “psmQ.txt” file produced by Mzion.
Once you have the PSM files, you can proceed with the data processing in proteoQ.
Examples of the corresponding PSMs are available through one of the
followings copy_ utilities (suggest go directly with real
PSM outputs from a search engine):
# Mascot
copy_mascot_gtmt()
# or MaxQuant
copy_maxquant_gtmt()
# or MSFragger
copy_msfragger_gtmt()
# or Mzion
copy_proteom_gtmt()To illustrate, I copy over Mascot PSMs to a working directory,
dat_dir:
The workflow includes using an Excel or .csv template that contains the metadata for multiplex experiments. This template includes information such as experiment numbers, TMT channels, LC/MS injection indexes, sample IDs, reference channels, RAW MS data file names, and additional user-defined fields. The default file name for the experimental summary is “expt_smry.xlsx”.
If samples were fractionated offline prior to LC/MS, a second Excel template is used to link multiple RAW MS file names associated with the same sample IDs. The default file name for the fractionation summary is “frac_smry.xlsx”.4
The columns in the “expt_smry.xlsx” file are roughly divided into three tiers:
Essential columns: These columns contain the required information for the TMT experiments.
Optional default columns: These columns provide convenient fields for look-ups in sample selection, grouping, ordering, aesthetics, etc. For example, the program will default to looking for values under the “Color” column when no specific instruction is given for color coding a PCA plot.
Optional open fields: These fields allow users to define their
own analysis and aesthetics. For example, multiple columns of contrasts
can be defined at different levels of granularity for use in statistical
modeling. More details and information can be found in the documentation
by using the command ?proteoQ::load_expts().
By default, throughout the document, we will assume the use of the “expt_smry.xlsx” and “frac_smry.xlsx” file names.
We next copy over a pre-compiled expt_smry.xlsx and a frac_smry.xlsx to the working directory:
We now have all the pieces that are required by proteoQ in place.
Let’s have a quick glance at the expt_smry.xlsx file. We note that no
reference channels were indicated under the column
Reference. With proteoQ, the log2FC of each species in a
given sample is calculated either (a) in relative to the
reference(s) within each multiplex TMT experiment or (b) to the
mean of all samples in the same plex if reference(s) are absent. Hence,
the later approach will be employed to the exemplary data set that we
are working with. In this special case, the mean(log2FC)
for a given species in each TMT experiment is averaged from five WHIM2
and five WHIM16 aliquots, which are biologically equivalent across TMT
experiments.
PSMs are MS/MS events that lead to peptide identification at certain confidence levels. The evidences in PSMs can then be summarized to peptide and protein findings using various descriptive statistics. In this section, we will apply proteoQ to summarize PSM data into peptide and protein reports.
We start the section by processing the PSM files from Mascot searches:
# columns keys in PSM files suitable for varargs of `filter_`
#
# (need utility `join_mgfs` with MSGF+ outputs)
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
rptr_intco = 1000,
rm_craps = TRUE,
rm_krts = FALSE,
rm_outliers = FALSE,
annot_kinases = TRUE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
filter_psms = rlang::exprs(pep_expect <= .1, pep_score >= 15),
filter_more_psms = rlang::exprs(pep_rank == 1),
)At group_psm_by = pep_seq, PSM entries with the same
primary peptide sequence but different variable modifications will be
grouped for analysis using descriptive statistics. Alternatively, if
group_psm_by = pep_seq_mod, PSMs will be grouped according
to the unique combination of the primary sequences and variable
modifications of peptides. Similarly, the group_pep_by
parameter specifies whether peptides should be grouped by protein
accession names or gene names. The fasta argument should
point to the location of a copy of the FASTA files used in the
corresponding MS/MS searches. For more information on normPSM, you can
access its help document via ?normPSM.
Each time the normPSMmodule is executed, it processes
the PSM data from scratch without any memory of previous operations. For
example, if we visually inspect the intensity distributions of reporter
ions using plot_rptr_int = TRUE, we may decide to
reconsider a more inclusive cutoff value, such as
rptr_intco = 100, instead of the default value of 1,000.
Decreasing rptr_intco from 1,000 to 100 does not result in
any loss of information. It is important to note that the change of
reporter-ion intensities from 100 to 1,000 does not cause data loss.
This is a trivial fact, but it is worth mentioning here. As we will
discover in the following sections, the peptide and protein
normalization utilities, standPep and
standPrn, do retain information from previous iterations
and can lead to interesting features, such as mixed-bed normalization of
data.
If experiments have similar quantities of input materials, there may
still be unforeseen events that can result in a decrease in the ranges
of reporter-ion intensity for certain samples. In such cases, it may be
justifiable to exclude outlier samples from further analysis. To remove
these outlier samples and reprocess the PSM data, you can simply delete
the corresponding entries under the “Sample_ID” column in the
expt_smry.xlsx file. Afterward, you can re-execute the
normPSM() function. For more information on data
exclusion and metadata
for LFQ, please refer to the additional notes provided.
There is a subtle problem when we choose to remove PSM outliers at
rm_outliers = TRUE. Note that PSM outliers will be assessed
at a per-peptide-and-per-sample basis, which could be a slow process for
large data sets. To circumvent repeated efforts in finding PSM outliers,
we may initially set rm_outliers = FALSE and
plot_rptr_int = TRUE when executing normPSM().
This will allow us to first decide on an ultimate threshold of
reporter-ion intensity, before proceeding to the more time-consuming
procedure in PSM outlier removals.
The normPSM function provides the flexibility to apply
user-defined arguments for row filtration using logical conditions. In
the example you provided, the filter_psms_atvariable
argument was used to limit the PSM entries based on conditions such as
pep_expect <= 0.1 and
pep_score >= 15.
You can choose to put the conditions in a single statement or in
multiple statements, as it makes no difference in the outcome. The
format for creating and assigning varargs is
filter_blahblah = exprs(cdn1, cdn2, ..., cdn_last), where
filter_blahblah indicates the task of data filtration, and
cdn1, cdn2, ..., cdn_last represent the logical conditions
based on the column keys in your PSM data.
If your column keys contain whitespace or special characters, you need to enclose them in backticks (`) to ensure proper referencing. This approach can be applied to other PSM data sources like MaxQuant, MSFragger, and Mzion, using the relevant column keys in your logical conditions.
Using the rlang package, you can supply the logical conditions within
round parentheses after exprs. The proteoQ program will
obtain the expressions on the right-hand side (rhs) of each vararg
statement and perform the necessary data filtrations using
rlang::eval_bare and rlang::eval_tidy.
While the approach of data filtration might seem unusual at first, it
allows for integrated data processing and reduces the need for direct
data manipulations. It enhances reproducibility and enables easy recall
of previous actions. Additionally, the built-in data filtration approach
can serve as a foundation for more complex data processing tasks, such
as mixed-bed normalization by sample groups against full or partial
data. You can refer to the help documents (?standPep and
?standPrn) for more information on these advanced data
processing capabilities in proteoQ.
With normPSM, we have the ability to filter data under any PSM columns as desired. In the example mentioned above, I have chosen to filter PSM entries based on pep_expect and pep_score. There is a rationale behind this selection.
Now, let’s consider a different column, pep_len. The values in this column are unique to both PSMs and peptides. However, it is not yet necessary to filter PSM data based on peptide length. We can delay the filtration of peptide entries by their sequence lengths until we are specifically working with peptide data. Summarizing PSMs to peptides does not change the number of amino acid residues in the peptides. On the other hand, the data under pep_expect is unique to PSMs, but not necessarily to peptides. It is evident that each PSM event for the same peptide is likely to have its individual confidence expectation in peptide identification. Therefore, if we were to filter data based on pep_expect values at a later stage of analysis, we would lose the authentic information regarding pep_expect for peptides with multiple PSM identifications. More specifically, the values under pep_expect in peptide tables represent the geometric-mean representation of PSM results (see also section 4).
To address this, I have named the vararg “filter_psms_at” in the above normPSM examples. This allows me to easily understand that I am filtering data based on criteria specific to PSMs.
Vararg statements of filter_ and arrange_
are available in proteoQ for flexible filtration and ordering of data
rows. In addition, there are filter2_ and
arrange2_. As indicated by their names,
filter_ and filter2_ perform row filtration
against column keys from a primary data file, df, and
secondary data file(s), df2, respectively (df
and df2 defined here).
The same correspondence is applicable for arrange_ and
arrange2_ varargs.
Users will typically employ either primary or secondary vararg
statements, but not both. In the more extreme case of
gspaMap(...), it links prnGSPA(...) findings
in df2 to the significance p-values and abundance fold
changes in df for volcano plot visualization by gene
sets.
To conclude our discussion on PSM processing, let’s explore another
tool called purgePSM. This utility is designed for data
cleanup and removes quantitations based on the coefficient of variation
(CV) of peptides, calculated from contributing PSMs within the same
sample. Specifically, quantitations that result in a peptide CV greater
than a specified cut-off value will be replaced with NA.
The purgePSM utility reads files such as
/PSM/TMTset1_LCMSinj1_PSM_N.txt, which are generated in the preceding
step of normPSM, and updates these files accordingly. If
there is a need to undo the nullification of data performed by
purgePSM, a complete restart of the normPSM
process is required. Alternatively, a temporary copy of these files can
be made for potential undo purposes.
This process is carried out column-wise, on a sample level, while
preserving the positions for data points that have been nullified. It
differs from the row filtration processes performed by the
filter_ functions, as there are no rows being removed
during the purging process.
Earlier in section 1.2.1, we have set
plot_log2FC_cv = TRUE by default when calling
normPSM. This will plot the distributions of the CV of
peptide log2FC. In the event of plot_log2FC_cv = FALSE, we
can have a second chance in visualizing the distributions of peptide CV
before any permanent data nullification:
Taking the sample entries under TMT_Set one and
LCMS_Injection one in label_scheme.xlsx as an example, we
can see that a small portion of peptides have CV greater than 0.5 at
log2 scale (Figure 1A).
Figure 1A-1C. CV of peptide log2FC (based on full data set). Left: no CV cut-off; middle: CV cut-off at 0.5; right: CV cut-off at 95 percentile.
Quantitative differences greater than 0.5 at a log2 scale is relatively large in TMT experiments 5, which can be in part ascribed to a phenomenon called peptide co-isolation and co-fragmentation in reporter ion-based MS experiments. We might, for instance, perform an additional cleanup by removing column-wisely data points with CV greater than 0.5 (Figure 1B):
The above method using a flat cut-off would probably fall short if the ranges of CV are considerably different across samples (see Lab 3.1). Alternatively, we can remove low-quality data points using a CV percentile, let’s say at 95%, for each sample (Figure 1C):
# copy back `\PSM\TMTset1_LCMSinj1_PSM_N.txt` etc. before proceed
# otherwise the net effect will be additive to the prior(s)
purgePSM (
pt_cv = 0.95,
)In the event of both pt_cv and max_cv being
applied to nullify data, they follow the precedence of
pt_cv > max_cv. When needed, we can overrule the default
by executing purgePSM sequentially at a custom order:
# at first no worse than 0.5
purgePSM (
max_cv = 0.5,
)
# next `pt_cv` on top of `max_cv`
purgePSM (
pt_cv = 0.95,
)The data purge is also additive w.r.t. to repetitive analysis. In the following example, we are actually perform data cleanup at a CV threshold of 90%:
While multiple PSMs carry information about the precision in peptide
measures, the above single-sample variance does not inform sampling
errors prior to peptide separations. For instance, the same peptide
species from a given sample remain indistinguishable/exchangeable prior
to the off-line fractionation. As a result, the CV shown by
normPSM or purgePSM mainly tell us the
uncertainty of measures beyond the point of peptide parting.
NB: CV is sensitive to outliers and some large CV in peptide
quantitations may be merely due to a small number of aberrant measures
(see George Casella and Roger L Berger 2002
ch. 10 for quantitative descriptions). Although the option of
rm_outliers was set to FALSE during our
earlier call to normPSM, I think it is generally a good
idea to have rm_outliers = TRUE.
To perform the summary of PSM results to peptides, we will utilize
the PSM2Pep utility. This utility will load the PSM tables
generated from the previous normPSM procedure, such as
PSM/TMTset1_LCMSinj1_PSM_N.txt, and summarize the data to peptide level
using various descriptive statistics (refer to Section 4 for more
details).
To invoke the PSM2Pep utility, you can use the following
code:
By default, PSM2Pep uses the median as the summarization
method for intensity and log2FC data. However, you can specify a
different summarization method for intensity and log2FC data by using
the “method_psm_pep” argument. PSM2Pep also provides the
option for data filtration using varargs. For example, you can filter
the data based on the MS1 intensities of peptide matches, which are
indicated under the “pep_tot_int” column. However, please note that the
example codes provided below should not be used with the current Mascot
example we are working on:
In order to make it functionally valid, we would need to include the Raw peptide match data when exporting Mascot PSMs.
After summarizing PSMs to peptides, the next step involves using the utility mergePep to merge individual peptide tables (such as Peptide/TMTset1_LCMSinj1_Peptide_N.txt, TMTset1_LCMSinj2_Peptide_N.txt, etc.) into one larger table called Peptide.txt. Here’s an example code for using mergePep:
Similar to normPSM and PSM2Pep, you can
apply data filtration using the column keys linked to the varargs of
filter_. In the example provided, the vararg statement
filter_peps_by is used to exclude peptide sequences with
more than 50 amino acid residues. You can customize the filters based on
your specific requirements. For instance, if you are interested in human
peptides from the PDX samples but not mouse peptides, you can specify
species == "human" as a filter criterion. Sometimes, during
the early stages of analysis, it may be unclear how to properly filter
the data. In such cases, additional quality assessments can be
performed, which we will explore shortly. Alternatively, you may choose
to keep as much information as possible and apply varargs in downstream
analysis.
The utility standPep is used to standardize peptide
results obtained from the mergePep step. It provides
additional choices for data alignment. Here’s an example code for using
standPep:
standPep(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
)The parameters range_log2r and range_int
define the ranges of peptide log2FC and reporter-ion intensity,
respectively. These ranges are used to determine the CV and scale the
log2FC values across samples. By default, the log2FC of the peptide data
will be aligned using median centering. However, if you choose
method_align = MGKernel, the log2FC values will be aligned
assuming multiple Gaussian kernels. The n_comp parameter
specifies the number of Gaussian kernels to use, and the
seed parameter sets a seed for reproducible fittings.
Additional parameters like maxit and epsilon
are defined in normalmixEM
and are used for fitting purposes.
It’s worth noting that density kernel estimates can occasionally capture spikes in the profiles of log2FC during data alignment. It is recommended that researchers inspect the alignment of ratio histograms and optimize data normalization by experimenting with different combinations of tuning parameters or by focusing on a subset of samples before proceeding to the next steps.
Furthermore, standPep can be applied to specific sample
columns and data rows, allowing for interactive and cumulative effects.
Combinations and iterations of features can lead to specialized sample
alignments, which will be discussed in more detail in sections 1.3.5 to
1.3.7. Before diving deeper into these details, it would be helpful to
utilize the pepHist utility discussed in the upcoming
section.
The pepHist utility plots the histograms of peptide
log2FC. It further bins the data by their contributing reporter-ion or
LFQ intensity. In the examples shown below, we compare the
log2FC profiles of peptides with and without scaling
normalization:6
# without scaling
pepHist(
scale_log2r = FALSE,
ncol = 10,
)
# with scaling
pepHist(
scale_log2r = TRUE,
ncol = 10,
)By default, the above calls topepHist will look for none
void entries under column Select in expt_smry.xlsx. This
will results in histogram plots with 60 panels in each, which may not be
easy to explore as a whole. In stead, we will break the plots down by
their data origins. We begin with modifying the expt_smry.xlsx file by
adding the columns BI_1, JHU_1 etc. Each of
the new columns includes sample entries that are tied to their
laboratory origins and TMT batches (the columns are actually already in
the expt_smry.xlsx).
We now are ready to plot histograms for each subset of the data. In
this document, we only display the plots using the BI_1
subset:
# without scaling
pepHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)NB: We interactively told pepHist() that we are
interested in sample entries under the newly created BI_1
column. Behind the scene, the interactions occurred via the reading of
the Setup workbook in expt_smry.xlsx. We also supply a file
name, assuming that we want to keep the previously generated plots with
default file names of Peptide_Histogram_N.png and
Peptide_Histogram_Z.png.
Figure 2A-2B. Histograms of peptide log2FC. Top:
scale_log2r = FALSE; bottom,
scale_log2r = TRUE
As expected, both the widths and the heights of log2FC profiles
become more comparable after the scaling normalization. However, such
adjustment may cause artifacts when the standard deviation across
samples are genuinely different. I typically test
scale_log2r at both TRUE and
FALSE, then make a choice in data scaling together with my
a priori knowledge of the characteristics of both samples and
references.7 We will use the same data set to illustrate
the impacts of reference selections in scaling normalization in Lab 3.1.
It should also be noted that the curves of Gaussian density in
histograms are calculated during the latest call to
standPep(...) with the option of
method_align = MGKernel. There is a useful side effect when
comparing leading and lagging profiles of log2FC. In the following
bare-bones example, we align differently the peptide log2FC with the
default method of median centering:
We then visualize the histograms of the ratio profiles (Figure 2C):
Within this document, the preceding example that involves
standPep(...) at method_align = MGKernel is
given in section 1.3.3. In this case, a comparison between the present
and the prior histograms will reveal the difference in ratio alignments
between a median centering and a three-Gaussian assumption. More
examples in the side effects can be found from the help document via
?standPep and ?pepHist.
Figure 2C-2D. Histograms of peptide log2FC. Top:
median-centering for all samples; bottom: W2.BI.TR2.TMT1
aligned differently by Gaussian density
The varargs of filter_ are also available in the
pepHist utility. With the following examples, we can
visualize the peptide log2FC with human and mouse
origins, respectively:
Now that we have been acquainted with pepHist, let’s
revisit and explore additionally standPep with its features
in normalizing data against defined sample columns (and data rows in the
following sections).
Needs in data (re)normalization may be encountered more often than
not. One of the trivial circumstances is that a multi-Gaussian kernel
can fail capturing the log2FC profiles for a subset of samples. This is
less an issue with a small number of samples. Using a trial-and-error
approach, we can start over with a new combination of parameters, such
as a different seed, and/or a different range of
range_log2r etc. However, the one-size-fit-all attempt may
remain inadequate when the number of samples is relatively large. With
proteoQ, we can chose to focus fit against selected samples.
This is again the job of argument col_select. Let’s say we
want to re-fit the log2FC for samples W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2.
We simply add a column, which I named it Select_sub, to
expt_smry.xlsx with the sample entries for re-fit being indicated under
the column:
We may then execute the following codes with argument
col_select being linked to the newly created column:
standPep(
method_align = MGKernel,
range_log2r = c(10, 90),
range_int = c(5, 95),
n_comp = 3,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
col_select = Select_sub,
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = mixed_bed_3.png,
)In the preceding execution of bare-bones standPep(),
samples were aligned by median centering (Figure 2C).
As expected, the current partial re-normalization only affects samples
W2.BI.TR2.TMT1 and W2.BI.TR2.TMT2 (Figure 2D,
W2.BI.TR2.TMT2 not shown). In other words, samples W2.BI.TR2.TMT1 and
W2.BI.TR2.TMT2 are now aligned by their Gaussian densities whereas the
remaining are by median centering. The combination allows us to align
sample by mixed-bedding the MC or the MGKernel
method.
We have previously applied the varargs of filter_ in
normPSM and mergePep to subset data rows. With
this type of arguments, data entries that have failed the filtration
criteria will be removed for indicated analysis.
Similarly, we employed the filter_ varargs in
pepHist to subset peptides with human or mouse origins
(section 1.3.4.3). This is often not an issue in informatic analysis and
visualization, as we do not typically overwrite the altered inputs on
external devices at the end. Sometimes we may however need to carry out
similar tasks based on partial inputs and update the complete set of
data for future uses. One of the circumstances is model parameterization
by a data subset and to apply the finding(s) to update the complete
set.
The standPep utility accepts variable arguments of
slice_. The vararg statement(s) identify a subset of data
rows from the Peptide.txt. The partial data will be taken for
parameterizing the alignment of log2FC across samples. In the
hypothetical example shown below, we normalize peptide data based
peptide entries with sequence lengths greater than 10 and smaller than
30. The full data set will be updated accordingly with the newly derived
parameters. Different to the filter_ varargs, there is
no data entry removals from the complete data set with the
slice_ procedure.
## DO NOT RUN
standPep(
...,
slice_peps_by = rlang::exprs(pep_len > 10, pep_len < 30),
)
## END of DO NOT RUNThe varargs are termed slice_ to make distinction to
filter_. Although it might at first seem a little involved,
the underlying mechanism is simple: col_select defines the
sample columns and slice_ defines the data
rows in Peptide.txt; and only the intersecting
area between columns and rows will be subject additively to the
parameterization in data alignment. The same pattern will be applied
every time we invoke standPep.
Just like col_select and filter_ in
pepHist, the combination in fixed argument
col_select and variable argument
slice_ can lead to features in versatile data processing.
Several working examples are detailed and can be accessed via
?standPep and ?standPrn.
Now it becomes elementary if we were to normalize data against a set
of housekeeping protein(s). Let’s say we have GAPDH in mind
as an accurate housekeeping invariant among the proteomes, and of course
we are confident about the good accuracy in their log2FC. We simply
slice the peptide entries under GAPDH out for
use as a normalizer:
standPep(
method_align = MC,
range_log2r = c(10, 90),
range_int = c(5, 95),
col_select = Select_sub,
slice_hskp = rlang::exprs(gene %in% c("GAPDH")),
)
pepHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = housekeepers.png,
)Note that I chose method_align = MC in the above. There
are only a few rows available for the samples linked to
col_select, after slicing out GAPDH! The number of data
points is too scare for fitting the selected samples against a
3-component Gaussian. A more detailed working example can also be found
via ?standPep where you would probably agree that GAPDH is
actually not a good normalizer for the data set.
Analogously to the PSM processing, we may nullify data points of peptides by specifying a cut-off in their protein CVs:
# no purging
purgePep()
# or purge column-wisely by max CV
purgePep (
max_cv = 0.5,
filename = "by_maxcv.png",
)
# or purge column-wisely by CV percentile
# remember the additive effects
purgePep (
pt_cv = 0.5,
filename = "by_ptcv.png",
)NB: The above single-sample CVs of proteins are based on
ascribing peptides, which thus do not inform the uncertainty in sample
handling prior to the parting of protein entities, for example, the
enzymatic breakdown of proteins in a typical MS-based proteomic
workflow. On the other hand, the peptide log2FC have been previously
summarized by the median statistics from contributing PSMs. Putting
these two together, the CV by purgePep describes
approximately the uncertainty in sample handling from the breakdown of
proteins to the off-line fractionation of peptides. (see also George Casella and Roger L Berger 2002
ch. 7.3.3 for the technical details of conditional variance.)
In this section, we summarize peptides to proteins, for example, using a two-component Gaussian kernel and customized filters.
The utility for the summary of peptides to proteins is
Pep2Prn:
It loads the Peptide.txt and summarize the peptide data to interim
protein results in Protein.txt, using various descriptive statistics
(see also Section 4). For intensity and log2FC data, the summarization
method is specified by argument method_pep_prn, with
median being the default.
The utitily also accept varargs of filter_ for data row
filtration against the column keys in Peptide.txt.
The utility standPrn standardizes protein results from
Pep2Prn with additional choices in data alignment.
standPrn(
range_log2r = c(10, 90),
range_int = c(5, 95),
method_align = MGKernel,
n_comp = 2,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
slice_prots_by = rlang::exprs(prot_n_pep >= 2),
)It loads Protein.txt from Pep2Prn or a preceding
standPrn procedure and align protein data at users’
choices. The utility is analogous to standPep with choices
in col_select and slice_. In the above
example, the normalization is against proteins with two more identifying
peptides. For helps, try ?standPrn.
Similar to the peptide summary, we can inspect the alignment and the scale of ratio profiles for protein data:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filename = bi1_n.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filename = bi1_z.png,
)For simplicity, we only display the histograms with scaling normalization (Figure 2E).
Figure 2E-2F. Histograms of protein log2FC at
scale_log2r = TRUE. Left: before filtration; right, after
filtration
In section 1.3.4.2, we used pepHist to illustrate the
side effects in histogram visualization when toggling the alignment
methods between MC and MGKernel. In the
following, we will show another example of side effects using the
protein data.
We prepare the ratio histograms for proteins with ten or more quantifying peptides:
# without scaling
prnHist(
scale_log2r = FALSE,
col_select = BI_1,
ncol = 5,
filter_prots_by = rlang::exprs(prot_n_pep >= 10),
filename = bi1_n_npep10.png,
)
# with scaling
prnHist(
scale_log2r = TRUE,
col_select = BI_1,
ncol = 5,
filter_prots_by = rlang::exprs(prot_n_pep >= 10),
filename = bi1_z_npep10.png,
)The density curves are based on the latest call to
standPrn(...) with method_align = MGKernel
(Figure 2E). For simplicity, we again only show the
current plots at scale_log2_r = TRUE (Figure
2F). The comparison between the lead and the lag allows us to
visualize the heteroscedasticity in data and in turn inform new
parameters in data renormalization.
Up to this point, we might have reach a consensus on the choice of
scaling normalization. If so, it may be plausible to set the value of
scale_log2r under the Global environment, which is
typically the R console that we are interacting with.
In this way, we can skip the repetitive setting of
scale_log2r in our workflow from this point on, and more
importantly, prevent ourselves from peppering the values of
TRUE or FALSE in scale_log2r from
analysis to analysis.
Exemplary scripts are summarized in:
system.file("extdata", "workflow_tmt_base.R", package = "proteoQ")
system.file("extdata", "workflow_tmt_ext.R", package = "proteoQ")Another place to get help is via ?load_expts.
Additional notes available in here.
For the purpose of quick demonstrations where steps in data preprocessing were bypassed:
unzip(system.file("extdata", "demo.zip", package = "proteoQDA"),
exdir = "~/proteoq_bypass", overwrite = FALSE)
# file.exists("~/proteoq_bypass/proteoQ/examples/Peptide/Peptide.txt")
# file.exists("~/proteoq_bypass/proteoQ/examples/Protein/Protein.txt")
library(proteoQ)
load_expts("~/proteoq_bypass/proteoQ/examples")
# Exemplary protein MDS
prnMDS(
show_ids = FALSE,
width = 8,
height = 4,
)In this section I illustrate the following applications of proteoQ:
Unless otherwise mentioned, the in-function filtration
of data by varargs of filter_ is available throughout this
section of informatic analysis. Row ordering of data, indicated by
arrange_, is available for heat map applications using
pepHM, prnHM and
plot_metaNMF.
We first visualize MDS and Euclidean distance against the peptide
data. We start with metric MDS for peptide data (prnMDS for
proteins):
Figure 3A. MDS of peptide log2FC at
scale_log2r = TRUE
It is clear that the WHIM2 and WHIM16 samples are well separated by
the Euclidean distance of log2FC (Figure 3A). We next
take the JHU data subset as an example to explore batch
effects in the proteomic sample handling:
# `JHU` subset
pepMDS(
col_select = JHU,
filename = jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)
Figure 3B-3C. MDS of peptide log2FC for the
JHU subset. Left: original aesthetics; right, modefied
aesthetics
We immediately spot that all samples are coded with the same color
(Figure 3B). This is not a surprise as the values under
column expt_smry.xlsx::Color are exclusively
JHU for the JHU subset. For similar reasons,
the two different batches of TMT1 and TMT2 are
distinguished by transparency, which is governed by the column “Alpha”
in expt_smry.xlsx. We may wish to modify the aesthetics using different
keys: e.g., color coding by WHIMs and size coding by batches, without
the recourse of writing new R scripts. One solution is to link the
attributes and sample IDs by creating additional columns in
expt_smry.xlsx. In this example, we have had coincidentally prepared the
column “Shape” and “Alpha” to code WHIMs and batches, respectively, for
the JHU subset. Therefore, we can recycle them directly to
make a new plot (Figure 3C):
# new `JHU` subset
pepMDS(
col_select = JHU,
col_fill = Shape, # WHIMs
col_size = Alpha, # batches
filename = new_jhu.png,
show_ids = FALSE,
height = 3,
width = 8,
)While MDS approximates Euclidean and other distance
measures at a low-dimensional space. Sometimes it may be useful to have
an accurate view of the distance matrix. Functions
pepEucDist and prnEucDist plot the heat maps
of Euclidean distance matrix for peptides and proteins, respectively.
Supposed that we are interested in visualizing the distance matrix for
the JHU subset:
# `JHU` subset
pepEucDist(
col_select = JHU,
annot_cols = c("Shape", "Alpha"),
annot_colnames = c("WHIM", "Batch"),
# `pheatmap` parameters
display_numbers = TRUE,
number_color = "grey30",
number_format = "%.1f",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
fontsize = 16,
fontsize_row = 20,
fontsize_col = 20,
fontsize_number = 8,
cluster_rows = TRUE,
show_rownames = TRUE,
show_colnames = TRUE,
border_color = "grey60",
cellwidth = 24,
cellheight = 24,
width = 14,
height = 12,
filename = jhu.png,
)The graphic controls of heat maps are achieved through pheatmap
with modifications. Parameter annot_cols defines the tracks
to be displayed on the top of distance-matrix plots. In this example, we
have chosen columns “Shape” and “Alpha” in expt_smry.xlsx to encodes the
WHIM subtypes and the batch numbers, respectively. Parameter
annot_colnames allows us to rename the tracks from
Shape and Alpha to WHIM and
Batch, respectively, for better intuition. We can
alternatively add columns WHIM and Batch if we
choose not to recycle and rename columns Shape and
Alpha.
Figure 3D. EucDist of peptide log2FC at
scale_log2r = TRUE
The utility is currently applied to Euclidean distances with an
argument adjEucDist for a probable compensation of
distances between TMT experiments. As mentioned earlier, the
quantitative log2FC are measured in relative to the reference materials
under each multiplex TMT experiments. When concatenating data across TMT
experiments, the measurement errors may accumulate differently. Likely
the uncertainty in the reference signals will be greater if we were to
prepare the references at an earlier stage of sample handling as opposed
to a later stage. I tried to go through the most fundamental
calculations step-by-step to help myself understand the differences:
Figure 3E. Accumulation of Euclidean distance in the
interplex comparison of log2FC
The adjustment might be more suitable for studies where both the
samples and references are largely similar in proteome compositions. The
setting of adjEucDist = TRUE would discount the distances
between references when using visualization techniques such a MDS or
distance heat maps. In the cases that sample differences are exceedingly
greater than handling errors, the setting of
adjEucDist = FALSE would probably be more appropriate.
The utilities for PCA analysis are pepPCA and
prnPCA for peptide and protein data, respectively. They are
wrappers of the stats::prcomp. Data scaling and centering
are the two aspects that have been emphasized greatly in PCA analysis.
Some notes on proteoQ data scaling are available in section 3.1.1; hence
in the present section, we will focus only on trials against data being
scaled.
With proteoQ, the option in data scaling is set by variable
scale_log2r, which will be passed to the
scale. in stats::prcomp. For data centering,
proteoQ relays the TRUE default to
stats::prcomp.
Provided the importance of data centering in PCA and several other
analyses, proteoQ further incorporated the three columns of
prot_mean_raw, prot_mean_n and
prot_mean_z in protein outputs. The first one summarizes
the mean log2FC before data alignment for individual
proteins across selected samples. The second and the third compute the
corresponding mean log2FC after data alignment, with and
without scaling normalization, respectively (see also section 4 for
column keys). The corresponding columns summarizing the mean deviation
in peptide data are pep_mean_raw, pep_mean_n
and pep_mean_z. As usual, the sample selections can be
customized through the argument col_select.
The mean log2FC of proteins or peptides may serve as
indicators that how far a given protein or peptide species is away from
the data centering format (a.k.a. mean deviation form) that will be
enforced by default in PCA. Taking protein data as an example, we will
go through couple settings in prnPCA. At first, we
performed PCA with data centering by default (Figure
4A):
We next performed another PCA with the removals of proteins that are far from mean deviation form (Figure 5B):
# observe that the overall deviations from "mean zero" may not be symmetric
prnPCA(
col_select = Select,
show_ids = FALSE,
filter_prots_by = rlang::exprs(prot_mean_z >= -.25, prot_mean_z <= .3),
filename = sub_cent.png,
)Note that the clusterings are tightened under each sample type of
WHIM2 or WHIM16 after the filter_prots_by filtration.
Further note that the proportion of variance explained in the
first principal axis decreased from 57.5% to 55.4% after the data
filtration. This suggests that the entries deviating the most from
mean zero are more leveraging towards the explained
variance, even with data centering. In other words, high deviating
entries are in general associated with above-average data variance, in
relative to the entire data set. The observation also indicates that a
high value of proportion of variance explained may not
necessary be a go-to standard for differentiating sample types in that
variance may be sensitive to leveraging data points.
Figure 4A-4B. PCA of protein log2FC with data centering
on. Left: without filtration; right, with filtration
We next explore the analogous, but by turning off data centering:
prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filename = nocent.png,
)
prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filter_prots_by = rlang::exprs(prot_mean_z >= -.25, prot_mean_z <= .3),
filename = sub_nocent.png,
)First note that there is no labels of the proportion of variance explained since such a view of variance is often not suitable without data centering. Instead, an interpretation as square Euclidean distance would be more appropriate.
Further note the wider spread in PC1 and narrower in PC2 for the analysis without the removal of high deviation entries (Figure 4C versus 4D). The driving force for the difference may be again ascribed to the more leveraging data entries. Intuitively speaking, the high leverage points tend to associate with higher-than-normal Euclidean distance. This becomes more evident after the removals of the high deviation entries (Figure 4D).
The above showcases that the choice in data centering can lead to different interpretation in biology, which may be in part ascribed to high deviation entries. The phenomena can, however, be conveniently explored via proteoQ.
Figure 4C-4D. PCA of protein log2FC. Left: data
centering off without filtration; right, data centering
off with filtration
The y-labels in Figure 4C are not well separated.
This can be fixed by providing a custom theme to prnPCA
(see also the help document via ?prnPCA). Alternatively, we
may export the PCA results for direct ggplot2:
res <- prnPCA(
col_select = Select,
center = FALSE,
show_ids = FALSE,
filename = foo.png,
)
# names(res)
library(ggplot2)
my_theme <- theme_bw() + theme(
axis.text.x = element_text(angle=0, vjust=0.5, size=20),
axis.text.y = element_text(angle=0, vjust=0.5, size=20),
axis.title.x = element_text(colour="black", size=20),
axis.title.y = element_text(colour="black", size=20),
plot.title = element_text(face="bold", colour="black", size=20, hjust=0.5, vjust=0.5),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.key = element_rect(colour = NA, fill = 'transparent'),
legend.background = element_rect(colour = NA, fill = "transparent"),
legend.title = element_blank(),
legend.text = element_text(colour="black", size=14),
legend.text.align = 0,
legend.box = NULL
)
p <- ggplot(res$pca) +
geom_point(aes(x = PC1, y = PC2, colour = Color, shape = Shape,
alpha = Alpha), size = 4, stroke = 0.02) +
scale_y_continuous(breaks = seq(5, 15, by = 5)) +
labs(title = "", x = paste0("PC1 (", res$prop_var[1], ")"), y = paste0("PC2 (", res$prop_var[2], ")")) +
coord_fixed() +
my_theme
ggsave(file.path(dat_dir, "Protein/PCA/nocent_2.png"), width = 6, height = 4)Figure 4E. Custom plot.
The PCA findings at higher dimensions may be visualized via pairwise plots between principal components.
Figure 4F. Higher dimensions.
Additional examples and analogous high-dimension MDS can be found
from the help documents via ?prnPCA and
?prnMDS, respectively.
See notes here.
In this section, we visualize the batch effects and biological
differences through correlation plots. The proteoQ tool currently limits
itself to a maximum of 44 samples for a correlation plot. In the
document, we will perform correlation analysis against the
PNNL data subset. By default, samples will be arranged by
the alphabetical order for entries under the column “Select” in
expt_smry.xlsx. We have learned from the earlier MDS analysis that the
batch effects are smaller than the differences between WHIM2 and WHIM16.
We may wish to put the TMT1 and TMT2 groups
adjacent to each other for visualization of more nuance batch effects,
followed by the comparison of WHIM subtypes. We can achieve this by
supervising sample IDs at a customized order. In the expt_smry.xlsx, We
have prepared an “Order” column where samples within the JHU subset were
arranged in the descending order of W2.TMT1, W2.TMT2, W16.TMT1 and
W16.TMT2. Now we tell the program to look for the “Order” column for
sample arrangement:
# peptide logFC
pepCorr_logFC(
col_select = PNNL,
col_order = Order,
filename = pep_pnnl.png,
)
# protein logFC
prnCorr_logFC(
col_select = PNNL,
col_order = Group,
filename = prn_pnnl.png,
)
Figure 5A-5B. Correlation of log2FC for the
PNNL subset. Left: peptide; right, protein
To visualize the correlation of intensity data, we can use
pepCorr_logInt and prnCorr_logInt for peptide
and protein data, respectively. More details can be assessed via
?pepCorr_logFC.
Heat map visualization is commonly applied in informatic analysis.
The corresponding facilities in proteoQ are pepHM and
prnHM for peptide and protein data, respectively. They are
wrappers of pheatmap
with modifications and exception handling. More details can be found by
accessing the help document via ?prnHM.
The following shows an example of protein heat map:
prnHM(
xmin = -1,
xmax = 1,
xmargin = 0.1,
annot_cols = c("Group", "Color", "Alpha", "Shape"),
annot_colnames = c("Group", "Lab", "Batch", "WHIM"),
cluster_rows = TRUE,
cutree_rows = 10,
show_rownames = FALSE,
show_colnames = TRUE,
fontsize_row = 3,
cellwidth = 14,
filter_sp = rlang::exprs(species == "human"),
)we chose to top annotate the heat map with the metadata that can be
found under the columns of “Group”, “Color”, “Alpha” and “Shape” in
expt_smry.xlsx. For better convention, we rename them to “Group”, “Lab”,
“Batch” and “WHIM” to reflect their sample characteristics. We further
supplied a vararg of filter_sp where we assume exclusive
interests in human proteins.
Figure 6A. Heat map visualization of protein log2FC
Row ordering of data is also implemented in the heat map utility.
prnHM(
xmin = -1,
xmax = 1,
xmargin = 0.1,
annot_cols = c("Group", "Color", "Alpha", "Shape"),
annot_colnames = c("Group", "Lab", "Batch", "WHIM"),
cluster_rows = FALSE,
annot_rows = c("kin_class"),
show_rownames = TRUE,
show_colnames = TRUE,
fontsize_row = 2,
cellheight = 2,
cellwidth = 14,
filter_kin = rlang::exprs(kin_attr == TRUE, species == "human"),
arrange_kin = rlang::exprs(kin_order, gene),
filename = "hukin_by_class.png",
)In the above example, we applied vararg filter_kin to
subset human kinases from the protein data set by values under its
“kin_attr” and the “species” columns. We further row annotate the heat
map with argument annot_rows, which will look for values
under the “kin_class” column. With the vararg, arrange_kin,
we supervise the row ordering of kinases by values under the “kin_order”
column and then those under the “gene” column. Analogous to the
user-supplied filter_ arguments, the row ordering varargs
need to start with arrange_ to indicate the task of row
ordering.
Figure 6B. Heat map visualization of kinase log2FC
See ?standPep for peptide examples.
In this section, we perform the significance analysis of peptide and
protein data. The approach of contrast fit (Chambers, J. M. Linear
models, 1992; Gordon Smyth et al., limma) is taken in
proteoQ. We will first define the contrast groups for significance
tests. For this purpose, I have divided the samples by their WHIM
subtypes, laboratory locations and batch numbers. This ends up with
entries of W2.BI.TMT1, W2.BI.TMT2 etc. under the column “Term” in
expt_smry.xlsx. The interactive environment between the Excel file and
the proteoQ tool allows us to enter more columns of contrasts when
needed. For instance, we might also be interested in a more course
comparison of inter-laboratory differences without batch effects. The
corresponding contrasts of W2.BI, W16.BI etc.
can be found under a pre-made column, “Term_2”. Having have these
columns in hand, we next perform significance tests and data
visualization for peptide and protein data:
# significance tests
pepSig(
W2_bat = ~ Term["W2.BI.TMT2-W2.BI.TMT1",
"W2.JHU.TMT2-W2.JHU.TMT1",
"W2.PNNL.TMT2-W2.PNNL.TMT1"], # batches
W2_loc = ~ Term_2["W2.BI-W2.JHU",
"W2.BI-W2.PNNL",
"W2.JHU-W2.PNNL"], # locations
W16_vs_W2 = ~ Term_3["W16-W2"], # types
)
# formulas matched to pepSig
prnSig()
# volcano plots
pepVol()
prnVol()Note that we have informed the pepSig and
prnSig utility to look for contrasts under columns “Term”,
“Term_2” etc., followed by the contrast pairs in square brackets. Pairs
of contrasts are separated by commas. For more examples, try
?prnSig.
The pepVol and prnVol utility will by
default match the formulas of contrasts with those in
pepSig. The following plots show the batch difference
between two TMT experiments for each of the three laboratories and the
location difference between any two laboratories.
Figure 7A-7B. Volcano plots of protein log2FC. Left: between batches; right: between locations.
In general, the special characters of + and
- in contrast terms need to be avoided in linear modeling.
However, it may be sporadically convenient to use A+B to
denote a combined treatment of both A and B.
In the case, we will put the term(s) containing + or
- into a pair of pointy brackets. The syntax in the
following hypothetical example will compare the effects of
A, B, A+B and the average of
A and B to control C.
# note that <A + B> is one condition whereas (A + B) contains two conditions
prnSig(
fml = ~ Term["A - C", "B - C", "<A + B> - C", "(A + B)/2 - C"],
)Contrast fits require the coefficients of members under each contrast
being summed to zero (see the help of limma). Taking the above as an
example, the coefficients are \(+1\)
and \(-1\) for the two conditions in
the contrast of “\(<\mathbf{A}+\mathbf{B}>-\mathbf{C}\)”,
and \(+1/2\), \(+1/2\) and \(-1\) for the three conditions in “\((\mathbf{A}+\mathbf{B})/2-\mathbf{C}\)”.
Moreover, we need to be aware of the difference between
prnSig(fml = ~ Term["A - C"] and
prnSig(fml = ~ Term["A - C", "B - C"] for the same pair of
contrast “\(\mathbf{A}-\mathbf{C}\)”.
The sample spaces are defined by groups \(\{\mathbf{A, C}\}\) in the former and \(\{\mathbf{A, B, C}\}\) in the latter. Such
difference might affect the underlying sample variance and
subsequently p-values (see also George Casella
and Roger L Berger 2002 ch. 11 for technical details).
In addition to the fixed effects shown above, significance tests with
additive random effects are also supported. More examples can be found
via ?prnSig and Lab 3.3 in the document.
There are a handful of R tools for gene set enrichment
analysis, such as GSEA, GSVA, gage, to name a few. It may be intuitive
as well if we can analyze and visualize the enrichment of gene sets
under the context of volcano plots at given contrasts. Provided the
richness of R utilities in linear modelings, proteoQ takes
a naive approach thereafter to assess the asymmetricity of the
distributions of protein probability \(p\) values (pVals) under volcano plots.
In the analysis of Gene Set Probability Asymmetricity
(GSPA), individual protein pVals from linear modeling are
first taken and separated into the groups of up (\(u\)) or down (\(d\)) expressed proteins within a gene set
(\(s\)). Taking proteins from a \(u\) group as an example, the default is to
calculate the geometric mean of pVals under \(s\) with a penalty-like term:
\[-log_{10}P_u=(-\sum_{i=1}^{n_u}log_{10}p_i-\sum_{j=1}^{m_u}log_{10}p_j)/(n_u+m_u)\]
where \(n_u\) and \(m_u\) are the numbers of entries with pVals smaller or greater than a significance cut-off, respectively, in the \(u\) group.
In practice, large pVals are often less interesting. To simplify the calculation, a representative \(p\) value of 0.1 is applied to the \(m\) insignificant proteins. This is equivalent to counting the number of insignificant entries. Thus, we have
\[-log_{10}P_u=(-\sum_{i=1}^{n_u}log_{10}p_i+m_u)/(n_u+m_u)\]
Repeat the same for the \(d\) group, we have
\[-log_{10}P_d=(-\sum_{i=1}^{n_d}log_{10}p_i+m_d)/(n_d+m_d)\]
The same calculations are iterated through all available gene sets for both the \(u\) and the \(d\) groups:
\[-log_{10}P_{s,u}=(-\sum_{i=1}^{n_{s,u}}log_{10}p_i+m_{s,u})/(n_{s,u}+m_{s,u})\]
\[-log_{10}P_{s,d}=(-\sum_{i=1}^{n_{s,d}}log_{10}p_i+m_{s,d})/(n_{s,d}+m_{s,d})\]
Next, the two \(P\) values are ordered in that \(P_{(1)} \le P_{(2)}\). The quotient, \(P_{\{1\}}/P_{(2)}\) is then taken to represent the bias in enrichment for a given gene set. Obviously the ordering between \(P_{s,u}\) and \(P_{s,d}\) is immaterial. It is just a means to ensue ratios ranging between zero and one with smaller values indicating greater asymmetricity, in accordance to the conventional notation of smaller pVals being more significant.
Alternatively, the asymmetricity in the population of pVals maybe assessed via conventional significance test, i.e. moderated t-test, between the \(u\) and the \(d\) groups (test statistic against \(-log_{10}p\) values of proteins between the two groups). With either method, the corresponding mean(log2FC) are each calculated for the ups and the downs where the difference is used as the fold change of enrichment.
At the input levels, the arguments pval_cutoff and
logFC_cutoff allow us to set aside low impact genes, for
instance, (re)distributing them between the \(n\)-entry significance group and the \(m\)-entry insignificance group. On the
output levels, argument gspval_cutoff sets a threshold in
gene set significance for reporting. More details can be found from the
help document via ?prnGSPA.
We began with the analysis of GSPA against enrichment
terms defined in gene ontology (GO),
molecular signatures (MSig)
and PhosphoSitePlus (PSP):
prnGSPA(
impute_na = FALSE,
pval_cutoff = 5E-2, # protein pVal threshold
logFC_cutoff = log2(1.2), # protein log2FC threshold
gspval_cutoff = 5E-2, # gene-set threshold
gslogFC_cutoff = log2(1.2), # gene-set log2FC threshold
gset_nms = c("go_sets", "c2_msig", "kinsub"),
)The formulas of contrasts will by default match to the those used in
pepSig. The species will be determined automatically from
input data and the corresponding databases will be loaded. In the above
example of PDX, databases of GO, MSig and
KinSub will be loaded for both human and mouse. If we
choose to focus on human proteins, we can add a vararg statement such as
filter_sp = exprs(species == "human").
We next visualize the distribution of protein log2FC and pVal within gene sets:
gspaMap(
gspval_cutoff = 5E-3,
gslogFC_cutoff = log2(1.2),
# gset_nms = c("go_sets"),
show_sig = pVal,
xco = 1.2, # position of two vertical lines for FC
yco = 0.05, # position of a horizental line for pVal
)This will produce the volcano plots of proteins under gene sets that have passed our selection criteria. Here, we show one of the examples:
Figure 8A. An example of volcano plots of protein log2FC under a gene set. Top, method = mean; bottom, method = limma.
In general, any data bases for any species finished at a format
similar to the go_hs.rds shown below are applicable for
prnGSPA analysis and subsequent gspaMap
visualization. I will deliberate the examples of custom GO
and MSig.
The gene sets of GO and MSig are available
for species human, mouse and rat in proteoQ. For custom gene sets and/or
additional species, the utility prepGO will download and
prepare GO data according to custom-supplied URLs. In the
follow examples, we prepare the GO data of
go_hs.rds and go_mm.rds for human
and mouse, respectively, under the file folder
~\\proteoQ\\dbs\\go:
prepGO(
species = human,
db_path = "~/proteoQ/dbs/go",
gaf_url = "http://current.geneontology.org/annotations/goa_human.gaf.gz",
obo_url = "http://purl.obolibrary.org/obo/go/go-basic.obo",
filename = go_hs.rds,
)
prepGO(
species = mouse,
db_path = "~/proteoQ/dbs/go",
gaf_url = "http://current.geneontology.org/annotations/mgi.gaf.gz",
obo_url = "http://purl.obolibrary.org/obo/go/go-basic.obo",
filename = go_mm.rds,
)
# head(readRDS(file.path("~/proteoQ/dbs/go", "go_hs.rds")))
# head(readRDS(file.path("~/proteoQ/dbs/go", "go_mm.rds")))Similarly, we prepare custom MSig data bases for
human and mouse:
prepMSig(
# msig_url = "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.0/c2.all.v7.0.entrez.gmt",
# db_path = "~/proteoQ/dbs/msig",
species = human,
filename = msig_hs.rds,
)
prepMSig(
# msig_url = "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.0/c2.all.v7.0.entrez.gmt",
# ortho_mart = mmusculus_gene_ensembl,
# db_path = "~/proteoQ/dbs/msig",
species = mouse,
filename = msig_mm.rds,
)
# head(readRDS(file.path("~/proteoQ/dbs/msig", "msig_hs.rds")))
# head(readRDS(file.path("~/proteoQ/dbs/msig", "msig_mm.rds")))We need to provide the list name of ortho_mart for
species other than human, mouse and rat. The value will be used for
ortholog lookups via biomaRt.
More details are available in the help document via
?prepMSig. Note that the data bases will be stored as
.rds files:
# start over
unlink(file.path(dat_dir, "Protein/GSPA"), recursive = TRUE, force = TRUE)
prnGSPA(
impute_na = FALSE,
pval_cutoff = 5E-2,
logFC_cutoff = log2(1.2),
gspval_cutoff = 5E-2,
gslogFC_cutoff = log2(1.2),
gset_nms = c("~/proteoQ/dbs/go/go_hs.rds",
"~/proteoQ/dbs/go/go_mm.rds",
"~/proteoQ/dbs/msig/msig_hs.rds",
"~/proteoQ/dbs/msig/msig_mm.rds"),
)
gspaMap(
gset_nms = c("~/proteoQ/dbs/go/go_hs.rds",
"~/proteoQ/dbs/go/go_mm.rds",
"~/proteoQ/dbs/msig/msig_hs.rds",
"~/proteoQ/dbs/msig/msig_mm.rds"),
impute_na = FALSE,
topn_labels = 0,
gspval_cutoff = 5E-2,
gslogFC_cutoff = log2(1.2),
show_sig = pVal,
xco = 1.2,
yco = 0.05,
)As expected, in the examples of MSig, some breast cancer
signatures in basal and luminal subtypes were captured.
Figure 8B. Examples of volcano plots of protein log2FC under molecular signatures.
Currently, proteoQ does not keep track of the values of
gset_nms in the various calls to prnGSPA. When
mapping the findings from prnGSPA to gspaMap,
we need to be responsible for the completeness of the gene-set
space. If we were to leave out the setting of
gset_nms, the default of
gset_nms = c("go_sets", "c2_msig", "kinsub") will be
applied when executing gspaMap. We might thus encounter
some discrepancies in the volcano plots of GO terms due to probable
differences between the default and the custom data bases.
For simplicity, it is generally applicable to include all the data
bases that have been applied to prnGSPA in a custom
workflow and, in that way, no terms will be missed for visualization
with gspaMap. This is also suitable in that
gspaMap merely perform volcano plot visualization by gene
sets and no multiple-test
correlations are involved. In other words, there is no penalty for
including more gene sets than necessary, other than some negligible
costs of computer memory and time.
In addition to finding gene sets that are biased in pVal
distributions, prnGSPA reports the essential gene sets
using a greedy set cover algorithm. The correspondence between essential
and all of the gene sets are stored in _essmap.txt files
under the Protein/GSPA folder.
The utility in proteoQ for conventional GSEA analysis is
prnGSEA(). Gene set variance analysis (GSVA) is available
through prnGSVA. Details can be found via
?prnGSEA and ?prnGSVA, respectively, from an
R console.
In the above section, we have plotted the enrichment of gene sets by individual GO or KEGG terms. Depending on how much the sample groups contrast to each other, we could have produced more plots where many of them might never get viewed. Besides, gene sets can be redundant with overlaps to one another to varying degrees. A means to communicate the gene set results at high levels is to present them as hierarchical trees or grouped networks.
In this section, we will visualize the connectivity of significant
gene sets by both distance heat maps and networks. For simplicity, the
heat maps or networks will be constructed only between gene sets and
essential gene sets. As mentioned in section
Gene sets under volcano plots, the essential gene sets were
approximated with greedy set cover. This will reduce the dimensionality
of data from \(n \times n\) to \(n \times m\) (\(m
\le n\)).
We next gauge the redundancy of a gene set in relative to an
essential set by counting the numbers of intersecting gene IDs. This is
documented as the fraction of overlap between gene sets when calling
prnGSPA. The values are available in output files such as
Protein/GSPA/essmap_.*.csv. For network visualization, the
gene sets are further classified by their distance using hierarchical
clustering.
In this following, we first perform simple heat map visualization between all significant gene sets in columns and essential groups in rows.
The distance in heat is \(D = 1-f\)
where \(f\) is the fraction of overlap
in IDs between two gene sets. The smaller the distance, the greater the
overlap is between two gene sets. For convenience, a “distance” column
is also made available in the _essmap.txt file.
Figure 8C. Heat map visualization of the distance between all and essential gene sets. The contrasts are defined in ‘prnSig(W2_loc = )’ in section 2.4 Significance tests and volcano plot visualization
As expected, we saw zero overlap between human and mouse gene sets.
Within each organism, low-redundancy red cells overwhelm
the heat map and might have impeded us from capturing high-redundancy
terms in blue. We can, however, readily de-emphasize the
red cells by data filtration. In the example shown below,
we chose to keep more redundant terms at distances shorter than or equal
to 0.33:
prnGSPAHM(
filter2_by = rlang::exprs(distance <= .33),
filter2_sp = rlang::exprs(start_with_str("hs", term)),
annot_cols = "ess_idx",
annot_colnames = "Eset index",
annot_rows = "ess_size",
filename = show_human_redundancy.png,
)Note that there is a second vararg expression,
exprs(start_with_str("hs", term)). In this expression, we
have used a pseudonym approach to subset terms starting with character
string hs under the column term in
GSPA result files, which corresponds to human gene sets for
both GO and KEGG.8 More examples of the pseudonym approach can
be found from Lab 3.2 in
this document. More examples of the utility can be found via
?prnGSPAHM.
Figure 8D. Heat map visualization of human gene sets at a distance cut-off 0.2
Aside from heat maps, prnGSPAHM produces the networks of
gene sets via networkD3,
for interactive exploration of gene set redundancy.
Figure 8E. Snapshots of the networks of biological terms. Left, distance <= 0.8; right, distance <= 0.2.
In this section, we perform the trend analysis against protein
expressions. It wraps around various R routines in
unsupervised clustering of data.
The proteoQ utility for the clustering of protein log2FC is
anal_prnTrend. Note that the number of clusters is provided
by n_clust, which can be a single value or a vector of
integers.
The above codes will generate result files,
Protein_Trend_Z_nclust5.txt and
Protein_Trend_Z_nclust6.txt, under the
.../Protein/Trend directory. The letter Z in
the file names remind us that the results were derived from normalized
protein data with the option of scale_log2r = TRUE. More
details are available via ?anal_prnTrend from a R
section.
We next visualize the results:
The argument col_order provides a means to supervise the
order of samples during the trend visualization. In the above example,
the plot_prnTrend will look into the field under the
“Order” column in expt_smry.xlsx for sample arrangement (see also
Section 2.3 Correlation plots).
Figure 9A. Trends of protein log2FC (n_clust = 6).
We can subset the secondary input data by filter2_
varargs. In the example shown below, we choose to visualize only the
pattern of trends in cluster 4. Note that cluster is a
column key in Protein_Trend_[...].txt:
plot_prnTrend(
col_order = Order,
filter2_by = rlang::exprs(cluster == 4),
width = 12,
height = 12,
filename = cl4.png,
)Figure 9B. Trends of protein log2FC at cluster 4 (n_clust = 6).
We can also select certain sample groups for visualization, for instance, the samples under the column “BI” in expt_smry.xlsx:
Figure 9C. Trends of protein log2FC for BI subset (n_clust = 6).
Note the difference between
and
Apparently, they will both plot the trends of protein log2FC for the
BI subset. In spite, the former is based on the clustering
results from the BI subset whereas the later is based on
the findings from all samples. The same consideration will typically
hold for various informatic analysis in proteoQ, including the NMF
analysis that we will next discuss.
The trend findings from anal_prnTrend can be loaded
automatically to the ClueGO
utility in Cytoscape. The
installation of yFiles Layout
Algorithms is also required.
# Make sure that Cytoscape is open
cluego(
df2 = Protein_Trend_Z_nclust5.txt,
species = c(human = "Homo Sapiens"),
n_clust = c(3, 5)
)Note that human is a value that can be found under the
column species in Protein_Trend_Z_nclust5.txt
and Homo Sapiens is the corresponding name used in
ClueGO.
In this section, we will performs the analysis of non-negative matrix
factorization (NMF) against protein data. More details can be found from
NMF
and the ?anal_prnNMF wrapper. Since additional arguments
can be passed on to NMF, we will test below protein classifications with
both the default and the ‘lee’ method:
# load library
library(NMF)
# NMF analysis
anal_prnNMF(
impute_na = FALSE,
col_group = Group, # optional a priori knowledge of sample groups
r = c(5:6),
nrun = 20,
seed = 123,
filter_by_npep = rlang::exprs(prot_n_pep >= 2),
)
anal_prnNMF(
impute_na = FALSE,
col_group = Group,
method = "lee",
r = c(5:6),
nrun = 20,
seed = 123,
filter_by_npep = rlang::exprs(prot_n_pep >= 2),
filename = lee.txt,
)Analogous analysis for peptide data are available via
anal_pepNMF(...).
Following the primary NMF analysis, secondary utilities of
plot_pepNMFCon and plot_prnNMFCon prepare the
consensus heat maps of peptide and protein data, respectively.
Similarly, plot_pepNMFCoef and plot_prnNMFCoef
prepare coefficient heat maps. Utility plot_metaNMF makes
the heat maps of protein log2FC. These utilities can pass arguments to
pheatmap as shown in Section 2.3. In the
examples shown below, we plot the heat maps for protein data against all
available ranks, which are 5 and 6, specified previously in the
anal_prnNMF step.
plot_prnNMFCon(
impute_na = FALSE,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
width = 14,
height = 14,
)
plot_prnNMFCoef(
impute_na = FALSE,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
width = 14,
height = 6,
)
plot_metaNMF(
impute_na = FALSE,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
cell_width = 6,
fontsize = 6,
fontsize_col = 5,
)Argument impute_na reminds us which piece(s) of NMF
results from the corresponding anal_[...]NMF will be used
for plotting. The same is true for scale_log2r, which
defaults at TRUE. An error message will be noted if no corresponding
analysis results were found.
Visualization against data subset is also feasible. In the next
example, we will prepare heat maps for samples under column “BI” in
expt_smry.xlsx. We further limit ourselves to results from
anal_prnNMF at r = 5. In metagene plots, we choose
additionally to row order data by genes via the arrange_
vararg:
plot_prnNMFCon(
impute_na = FALSE,
col_select = BI,
r = 5,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
fontsize = 8,
fontsize_col = 6,
fontsize_row = 6,
width = 6.5,
height = 6,
filename = bi_r5_con.png,
)
plot_prnNMFCoef(
impute_na = FALSE,
col_select = BI,
r = 5,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
fontsize = 8,
fontsize_col = 6,
fontsize_row = 6,
width = 12,
height = 3,
filename = bi_r5_coef.png,
)
plot_metaNMF(
impute_na = FALSE,
col_select = BI,
r = 5,
annot_cols = c("Color", "Alpha", "Shape"),
annot_colnames = c("Lab", "Batch", "WHIM"),
# fontsize = 5,
# fontsize_col = 5,
# cellwidth = 6,
# cellheight = 6,
cluster_rows = FALSE,
arrange_by = rlang::exprs(gene),
filename = bi_r5_rowordered.png,
)The silhouette information was obtained via the R package
cluster and shown as a track on the top of consensus and
coefficient heat maps.
Figure 10A-10B. Heat map visualization of protein NMF results with default method (results from method = “lee” not shown). Left: concensus; right: coefficients; metagenes not shown.
While utility plot_prnTrend in trend visualization
(Section 2.7) can take a customized theme for uses in
ggplot2
therein, the plot_ functions in NMF are wrappers of pheatmap
and thus can process a user-supplied color palette.
The following performs the STRING analysis of
protein-protein interactions. More details can be found from
?anal_prnString and ?prepString.
anal_prnString(
db_path = "~/proteoQ/dbs/string",
db_nms = c("~/proteoQ/dbs/string/string_hs.rds",
"~/proteoQ/dbs/string/string_mm.rds"),
score_cutoff = .9,
filter_by_sp = rlang::exprs(species %in% c("human", "mouse")),
filter_prots_by = rlang::exprs(prot_n_pep >= 2),
)The results of protein-protein interaction is summarised in
Protein_String_[...]_ppi.tsv and the expression data in
Protein_String_[...]_expr.tsv. The files are formatted for
direct applications with Cytoscape. When calling
anal_prnString, the corresponding databases will be
downloaded automatically if not yet present locally. One can also choose
to download separately the databases for a given
species:
prepString(
species = this_mouse,
links_url = "https://stringdb-static.org/download/protein.links.full.v11.0/10090.protein.links.full.v11.0.txt.gz",
aliases_url = "https://stringdb-static.org/download/protein.aliases.v11.0/10090.protein.aliases.v11.0.txt.gz",
info_url = "https://stringdb-static.org/download/protein.info.v11.0/10090.protein.info.v11.0.txt.gz",
filename = my_mm.rds,
)Imputation of peptide and protein data are handle with
pepImp and prnImp. More information can be
found from mice
and ?prnImp.
In this lab, we explore the effects of reference choices on data normalization and cleanup.
We first copy data over to the file directory specified by
temp_dir, followed by PSM, peptide normalization and
histogram visualization of peptide log2FC.
# exemplary data
temp_dir <- "~/proteoQ/ref_w2"
library(proteoQDA)
copy_mascot_gtmt(temp_dir)
copy_exptsmry_w2ref(temp_dir)
copy_fracsmry_gtmt(temp_dir)
# analysis
library(proteoQ)
load_expts(temp_dir, expt_smry_ref_w2.xlsx)
normPSM(
group_psm_by = pep_seq,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
rptr_intco = 1000,
rm_craps = TRUE,
rm_krts = FALSE,
rm_outliers = FALSE,
annot_kinases = TRUE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
filter_peps = rlang::exprs(pep_expect <= .1),
)
PSM2Pep()
mergePep()
standPep()
pepHist(
scale_log2r = FALSE,
ncol = 9,
)Notice that in the histograms the log2FC profiles of
WHIM16 samples are much narrower than those of
WHIM2 (Figure S1A). This will occur when a
reference is more similar to one group of sample(s) than the other. In
our case, the reference is one of WHIM2. The difference in
the breadth of log2FC profiles between the WHIM16 and the
WHIM2 groups is likely due to the genuine difference in
their proteomes. If the above argument is valid, a scaling normalize
would moderate, and thus bias, the quantitative difference in proteomes
between WHIM2 and WHIM16.
Figure S1A. Histograms of peptide log2FC with a WHIM2 reference.
We may alternatively seek a “center-of-mass” representation for uses
as references where each sample may be regarded as a weighted particle
in the center construction (see George Casella
and Roger L Berger 2002 ch. 5.2 & ch 11.3.5 for some quantitative
descriptions about data centers). We select one
WHIM2 and one WHIM16 from each 10-plex TMT.
The proteoQ tool will average the signals from designated references.
Therefore, the derived reference can be viewed as a mid point of the
WHIM2 and the WHIM16 proteomes.
We next perform analogously the data summary and histogram visualization.
temp_dir_w2w16 <- "~/proteoQ/ref_w2w16"
library(proteoQDA)
copy_mascot_gtmt(temp_dir_w2w16)
copy_exptsmry_w2w16ref(temp_dir_w2w16)
copy_fracsmry_gtmt(temp_dir_w2w16)
library(proteoQ)
load_expts(temp_dir_w2w16, expt_smry_ref_w2_w16.xlsx)
normPSM(
group_psm_by = pep_seq,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
rptr_intco = 1000,
rm_craps = TRUE,
rm_krts = FALSE,
rm_outliers = FALSE,
annot_kinases = TRUE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
filter_peps = rlang::exprs(pep_expect <= .1),
)
PSM2Pep()
mergePep()
standPep()
pepHist(
scale_log2r = FALSE,
ncol = 8,
)With the new reference, we have achieved log2FC profiles that are
more comparable in breadth between WHIM2 and
WHIM16 samples and a subsequent scaling normalization seems
more suitable.
Figure S1B. Histograms of peptide log2FC with a combined WHIM2 and WHIM16 reference.
In this section, we explore the effects of reference choices on the
CV of log2FC. For simplicity, we will visualize the peptide data that
link to the BI subset at batch number one. We first add a
new column, let’s say “BI_1”, in expt_smry_ref_w2.xlsx with the
corresponding samples being indicated. We next display the distributions
of proteins CV measured from contributing peptides before data removals
(Figure S1C):
# continue on the `ref_w2` example in section 3.1.1
library(proteoQ)
load_expts("~/proteoQ/ref_w2", expt_smry_ref_w2.xlsx, frac_smry.xlsx)
# `BI_1` subset for visualization
purgePep(
col_select = BI_1,
ymax = 1.2,
ybreaks = .5,
width = 8,
height = 8,
flip_coord = TRUE,
filename = bi1.png,
)Notice that the CV distributions of WHIM2 are much
narrower than those of WHIM16. This makes intuitive sense
given that the log2FC profiles of WHIM2 are much narrows as
well (Figure S1A). To discount the genuine difference
in sample CV, we next trim relatively the data points by
percentiles:
purgePep(
col_select = BI_1,
pt_cv = .95,
ymax = 1.2,
ybreaks = .5,
width = 8,
height = 8,
flip_coord = TRUE,
filename = bi1_ptcv.png,
)Figure S1C-S1D. Protein CV from peptide measures with WHIM2 reference. Left: before trimming; right: after trimming.
The row filtrations and column additions of data are both available in proteoQ.
In this lab, we will first apply pseudoname approaches to subset data. The available pseudonyms include
contain_str: contain a literal string; “PEPTIDES”
contain_str “TIDE”.contain_chars_in: contain some of the characters in a
literal string; “PEPTIDES” contain_chars_in “XP”.not_contain_str: not contain a literal string;
“PEPTIDES” not_contain_str “TED”.not_contain_chars_in: not contain any of the characters
in a literal string; “PEPTIDES” not_contain_chars_in “CAB”.start_with_str: start with a literal string. “PEPTIDES”
start_with_str “PEP”.end_with_str: end with a literal string. “PEPTIDES”
end_with_str “TIDES”.start_with_chars_in: start with one of the characters
in a literal string. “PEPTIDES” start_with_chars_in “XP”.ends_with_chars_in: end with one of the characters in a
literal string. “PEPTIDES” ends_with_chars_in “XS”.These functions are typically coupled to the varargs of
filter_ or slice_ for the subsetting of data
rows based on their names. More information can be found from the help
document via ?contain_str. In the following example, we
will apply contain_chars_in to subset peptide data.
The CPTAC publication contains both global and phosphopeptide data
from the same samples. This allows us to explore the stoichiometry of
phosphopeptide subsets in relative to the combined data sets of
global + phospho peptides. We first copy over both the
global and the phospho data sets to the file directory specified by
dat_dir, followed by PSM, peptide normalization and
histogram visualization of peptide log2FC of the BI_1
subset.
# exemplary data
dat_dir <- "~/proteoQ/phospho_stoichiometry"
# dir.create(dat_dir, recursive = TRUE, showWarnings = FALSE)
library(proteoQDA)
copy_mascot_gtmt(dat_dir)
copy_mascot_ptmt(dat_dir)
copy_exptsmry_cmbn(dat_dir)
copy_fracsmry_cmbn(dat_dir)
# analysis
library(proteoQ)
load_expts()
# note that `group_psm_by = pep_seq_mod`
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
filter_peps = rlang::exprs(pep_expect <= .1),
)
PSM2Pep()
mergePep()
standPep(
method_align = MGKernel,
range_log2r = c(10, 95),
range_int = c(5, 95),
n_comp = 3,
seed = 883,
maxit = 200,
epsilon = 1e-05,
)
# (a) phospho subsets without y-scaling
pepHist(
col_select = BI_1,
filter_peps = rlang::exprs(contain_chars_in("sty", pep_seq_mod)),
scale_y = FALSE,
ncol = 5,
filename = pSTY_bi1_scaley_no.png,
)
# (b) phospho subsets with y-scaling
pepHist(
col_select = BI_1,
filter_peps = rlang::exprs(contain_chars_in("sty", pep_seq_mod)),
scale_y = TRUE,
ncol = 5,
filename = pSTY_bi1_scaley_yes.png,
)Note that we have applied the new grammar of
contain_chars_in("sty", pep_seq_mod) to extract character
strings containing lower-case letters ‘s’, ‘t’ or ‘y’ under the
pep_seq_mod column in Peptide.txt. This
corresponds to the subsettting of peptides with phosphorylation(s) in
serine, thereonine or tyrosine.9
Figure S2A-S2B. Histograms of log2FC. Left: phosphopeptides without y-axix scaling; right: phosphopeptides with y-axix scaling. The density curves are from the combined data of global + phospho.
Ideally, the profiles of the log2FC between the phospho subsets and the overall data would either align at the maximum density or perhaps offset by similar distance among replicated samples. In this example, the alignment at maximum density seems to be the case. The observation raises the possibility of measuring the stoichiometry of phosphoproteomes in relative to global data across sample types or conditions.
In addition to pseudonyms, convenience columns such as
pep_mod_protntac and pep_mod_sty are made
available in Peptide.txt, to indicate the property of
peptide modifications of protein N-terminal acetylation and
phosphorylation, respectively. We can use alternatively the column keys
to subset data, for example, extracting peptides from N-terminal
acetylated proteins:
# (c) N-term acetylation subsets without y-scaling
pepHist(
col_select = BI_1,
scale_log2r = TRUE,
filter_peps = rlang::exprs(pep_mod_protntac == TRUE),
scale_y = FALSE,
ncol = 5,
filename = bi1_nac_scaley_no.png,
)
# (d) N-term acetylation subsets with y-scaling
pepHist(
col_select = BI_1,
scale_log2r = TRUE,
filter_peps = rlang::exprs(pep_mod_protntac),
scale_y = TRUE,
ncol = 5,
filename = bi1_nac_scaley_yes.png,
)Figure S2C-S2D. Histograms of the log2FC of peptides from N-terminal acetylated proteins. Left: without y-axix scaling; right: with y-axix scaling.
Pseudonyms and convenience columns can be used interexchangeably for simple conditions. In the following example, we assume that peptide sequences are under the column “pep_seq_mod” in Peptide.txt with variably modified residues in lower case. we can exclude oxidized methionine or deamidated asparagine from uses in data normalization:
Pep2Prn(
filter_by_mn = rlang::exprs(not_contain_chars_in("mn", pep_seq_mod)),
)
standPrn(
method_align = MGKernel,
range_log2r = c(5, 95),
range_int = c(5, 95),
n_comp = 2,
seed = 749662,
maxit = 200,
epsilon = 1e-05,
)
prnHist(
col_select = BI_1,
scale_log2r = TRUE,
scale_y = FALSE,
ncol = 5,
filter_prns_by = rlang::exprs(species == "mouse"),
filename = "bi1_nac_scaley_no.png",
)
prnHist(
col_select = BI_1,
scale_log2r = TRUE,
scale_y = TRUE,
ncol = 5,
filter_prns_by = rlang::exprs(species == "mouse"),
filename = "bi1_nac_scaley_yes.png",
)or use alternatively the convenience columns, pep_mod_m
and pep_mod_n, for the same purpose:
Customer supplied columns can be further taken by proteoQ for various
data processing and informatic analyses. In this section, we will first
add a column, n_not_na, to protein table
Protein.txt. The column summarizes the number of log2FCs
that are NOT missing for each protein. The newly added column
will then be applied to data-row filtration during heat map
visualization.
# add a column to "Protein.txt"
df <- readr::read_tsv(file.path(dat_dir, "Protein/Protein.txt"))
library(magrittr)
n_not_na <- df %>%
dplyr::select(grep("Z_log2_R", names(.))) %>%
dplyr::select(-grep("\\(Ref|\\(Empty", names(.))) %>%
is.na() %>%
`!`() %>%
rowSums()
df %>%
dplyr::mutate(n_not_na = n_not_na) %>%
# proteoQ::reorderCols2() %>%
readr::write_tsv(file.path(dat_dir, "Protein/Protein.txt"))Note that there is a restriction in column additions in that the custom column(s) need to be anchored before the intensity and ratio fields for uses in downstream analyses. This is achieved behind the scene when the modified file is loaded, for example, in protein heat map visualization:
prnHM(
df = "Protein/Protein.txt",
xmin = -1,
xmax = 1,
xmargin = 0.1,
annot_cols = c("Group", "Color", "Alpha", "Shape"),
annot_colnames = c("Group", "Lab", "Batch", "WHIM"),
cluster_rows = TRUE,
cutree_rows = 10,
show_rownames = FALSE,
show_colnames = TRUE,
fontsize_row = 3,
cellwidth = 14,
width = 18,
height = 12,
filter_prots_by_sp_npep = rlang::exprs(n_not_na <= 10),
filename = "mostly_na_vals.png",
)Importantly, we need to supply the file name to argument
df. This is because a higher precedence will be given to
Model/Protein_pVals.txt over Protein.txt.
Without specifying the value of df, proteoQ will look for
the n_not_na column that are indeed absent from
Protein_pVals.txt.
Figure S2E. Scarce heat map.
Alternatively, we may add the custom column to
Protein_pVals.txt:
df <- readr::read_tsv(file.path(dat_dir, "Protein/Model/Protein_pVals.txt"))
n_not_na <- df %>%
dplyr::select(grep("Z_log2_R", names(.))) %>%
dplyr::select(-grep("\\(Ref|\\(Empty", names(.))) %>%
is.na() %>%
`!`() %>%
rowSums()
df %>%
dplyr::mutate(na_counts = na_counts) %>%
readr::write_tsv(file.path(dat_dir, "Protein/Model/Protein_pVals.txt"))
prnHM(
df = "Protein/Model/Protein_pVals.txt",
xmin = -1,
xmax = 1,
xmargin = 0.1,
annot_cols = c("Group", "Color", "Alpha", "Shape"),
annot_colnames = c("Group", "Lab", "Batch", "WHIM"),
cluster_rows = TRUE,
cutree_rows = 10,
show_rownames = FALSE,
show_colnames = TRUE,
fontsize_row = 3,
cellwidth = 14,
width = 18,
height = 12,
filter_prots_by_sp_npep = rlang::exprs(n_not_na <= 10),
filename = "na30.png",
)Models that incorporate both fixed- and random-effects terms in a linear predictor expression are often termed mixed effects models.
In proteomic studies involved multiple multiplex TMT
experiments, the limited multiplicity of isobaric tags requires sample
parting into subgroups. Measures in log2FC are then obtained within each
subgroup by comparing to common reference materials, followed by data
bridging across experiments. This setup violates the independence
assumption in statistical sampling as the measures of log2FC are batched
by TMT experiments. In this lab, we will use the CPTAC data to test the
statistical significance in protein abundance between the WHIM2 and the
WHIM16 subtypes, by first taking the batch effects into account. We will
use mixed-effects models to explore the random effects that were
introduced by the data stitching. In case that you would like to find
out more about mixed-effects models in R, I found the online tutorial
a helpful resource.
We start off by (re)executing the reduced example shown in
?load_expts:
dat_dir <- "~/proteoQ/randeffs/examples"
library(proteoQDA)
copy_exptsmry_gtmt(dat_dir)
copy_fracsmry_gtmt(dat_dir)
copy_mascot_gtmt(dat_dir)
library(proteoQ)
load_expts()
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
annot_kinases = TRUE,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
)
PSM2Pep()
mergePep()
standPep()
pepHist()
Pep2Prn(use_unique_pep = TRUE)
standPrn()
prnHist()We next carry out the significance tests with and without random effects:
pepSig(
impute_na = FALSE,
W2_vs_W16_fix = ~ Term_3["W16-W2"], # fixed effect only
W2_vs_W16_mix = ~ Term_3["W16-W2"] + (1|TMT_Set), # one fixed and one random effects
)
prnSig(impute_na = FALSE)
# volcano plots
prnVol(impute_na = FALSE)In the formula linked to argument W2_vs_W16_mix, the
random effect (1|TMT_Set) is an addition to the fix effect
Term_3["W16-W2"]. The syntax (1|TMT_Set)
indicates the TMT_Set term to be parsed as a random effect.
The name of the term is again a column key in
expt_smry.xlsx. In this example, the TMT
batches are documented under the column TMT_Set and can be
applied directly to our formula.
Upon the completion of the protein significance tests, we can analyze
analogously the gene set enrichment against these new formulas by
calling functions prnGSPA, gspaMAP and
prnGSPAHM. This results will contain random effects in
enrichment analysis against gene sets.
In this section, we will test the statistical significance in protein
abundance changes between the WHIM2 and the
WHIM16 subtypes, by taking additively both the TMT batch
effects and the laboratory effects into account. At the time of writing
the document, I don’t yet know how to handle multiple random effects
using limma. Alternatively, I use lmerTest to
do the work.
Missing values can frequently fail random-effects modeling with more complex error structures and need additional cares. One workaround is to simply restrict ourselves to entries that are complete in cases. This would lead to a number of proteins not measurable in their statistical significance. Alternatively, we may seek to fill in missing values using techniques such as multivariate imputation.
We further note that the laboratory differences are coded under
columns Color in expt_smry.xlsx. We then test the
statistical difference between WHIM2 and
WHIM16 against the following three models:
# impute NA
pepImp(m = 2, maxit = 2)
prnImp(m = 5, maxit = 5)
# significance tests
# really take a while; need to expedite `lm` as mentioned in "Advanced R" (Hadley Wichham, Ch. 24)
pepSig(
impute_na = TRUE, # otherwise coerce to complete cases at multiple random effects
method = lm,
W2_vs_W16_fix = ~ Term_3["W16-W2"], # one fixed effect
W2_vs_W16_mix = ~ Term_3["W16-W2"] + (1|TMT_Set), # one fixed and one random effect
W2_vs_W16_mix_2 = ~ Term_3["W16-W2"] + (1|TMT_Set) + (1|Color), # one fixed and two random effects
)
prnSig(
impute_na = TRUE, # otherwise coerce to complete cases at multiple random effects
method = lm,
W2_vs_W16_fix = ~ Term_3["W16-W2"], # one fixed effect
W2_vs_W16_mix = ~ Term_3["W16-W2"] + (1|TMT_Set), # one fixed and one random effect
W2_vs_W16_mix_2 = ~ Term_3["W16-W2"] + (1|TMT_Set) + (1|Color), # one fixed and two random effects
)
# correlation plots
read.csv(file.path(dat_dir, "Protein/Model/Protein_pVals.txt"),
check.names = FALSE, header = TRUE, sep = "\t") %>%
dplyr::select(grep("pVal\\s+", names(.))) %>%
`colnames<-`(c("none", "one", "two")) %>%
dplyr::mutate_all(~ -log10(.x)) %>%
GGally::ggpairs(columnLabels = as.character(names(.)), labeller = label_wrap_gen(10), title = "",
xlab = expression("pVal ("*-log[10]*")"), ylab = expression("pVal ("*-log[10]*")")) The correlation plots indicate that the random effects of batches and
laboratory locations are much smaller than the fixed effect of the
biological differences of WHIM2 and
WHIM16.
Figure S3. Pearson r of protein significance p-values.
The results are reported at the levels of PSMs, peptides and proteins. The order of column keys may vary slightly provided differences in databases, accession types, and/or function parameters.
PSMs are reported at the basis of per TMT experiment per series of
LC/MS data acquisition. The names of the result files are
TMTset1_LCMSinj1_PSM_N.txt,
TMTset2_LCMSinj1_PSM_N.txt etc. with the indexes of TMT
experiment and LC/MS injection index being indicated . The column keys
are described in Matrix Science,
MaxQuant,
MSFragger or
the users’ manual of Spectrum Mill, with the following additions or
modifications:
| Header | Descrption | Note |
|---|---|---|
| prot_hit_num | Ordinal number of the protein hit | |
| prot_family_member | Ordinal number of the protein family member | |
| prot_acc | Protein accession string | MaxQuant, the leading entry in
Proteins. |
| prot_desc | Protein description taken from Fasta title line | |
| prot_score | Mascot protein score | |
| prot_mass | Protein mass | |
| prot_matches | Mascot count of PSMs | Before data joining. |
| prot_matches_sig | Mascot count of PSMs that have significant scores under a proposed protein | v.s. |
| prot_sequences | Mascot count of distinct sequences | v.s. |
| prot_sequences_sig | Mascot count of distinct sequences that have significant scores under a proposed protein | v.s. |
| prot_n_psm | Count of significant PSMs in quantitation under a proposed protein | By each TMT (LFQ) experiment and/or LC/MS series; the counts exclude entries that are void in reporter-ion intensity across all channels or filtered by users. |
| prot_n_pep | Count of significant peptide sequences in quantitation under a proposed protein | v.s.; at group_psm_by = pep_seq_mod, the
same peptide with different variable modifications will be counted as
different sequences. |
| prot_len | Number of amino acid residues under a proposed protein | |
| prot_pi | Mascot calculation of protein isoelectric point | |
| prot_tax_str | Mascot protein taxonomy as string | |
| prot_tax_id | Mascot protein taxonomy as Tax ID | |
| prot_seq | Mascot protein sequence in 1 letter code | |
| prot_empai | Mascot empirical protein abundance index | |
| prot_icover | Protein sequence coverage by counts of tryptic peptides | Number of observed sequences divided by number of possible sequences (only applied to tryptic sequences). |
| prot_cover | Protein sequence coverage | Calculated from the union of individual data sources. |
| prot_issig | Protein identification is significant or not | |
| prot_isess | Protein identification is essential or not | |
| prot_tier | Protein tier | |
| prot_es | Protein enrichment score | |
| prot_es_co | Protein enrichment score cut-off | |
| pep_query | Mascot ordinal number of query after sorting by Mr | |
| pep_rank | Mascot peptide sequence match (PSM) rank. If two PSMs have same score they have the same rank. | |
| pep_n_psm | Count of significant PSMs in quantitation under a proposed peptide | See also prot_n_psm. |
| pep_isbold | Mascot: if grouping enabled, then a significant PSM. Otherwise, indicates this is the highest scoring protein that contains a match to this query. | |
| pep_isunique | Peptide sequence is unique or not. | Floating indicator at the levels of protein groups,
protein IDs or none, according to
normPSM(pep_unique_by = ...). |
| pep_literal_unique | Peptide sequence is unique to hit or not. | Mascot: need enabled same-set and sub-set proteins during PSM exports. |
| pep_razor_unique | Peptide sequence is unique to group or not. | v.s. |
| pep_tot_int | Total MS1 intenisty of a peptide match | Mascot: need enabled
Raw peptide match data in PSM exports; also
MS/MS peak list for pep_ms2_sumint and
pep_n_ions. |
| pep_unique_int | Unique MS1 intenisty of a peptide match | v.s. |
| pep_razor_int | Razor MS1 intenisty of a peptide match | v.s. |
| pep_exp_mz | Experimental m/z value | |
| pep_exp_mr | Molecular mass calculated from experimental m/z value | |
| pep_exp_z | Experimental charge state | |
| pep_n_exp_z | Number of unique charge states under a peptide sequence | The levels of peptide uniqueness are according to
group_psm_by. |
| pep_calc_mr | Molecular mass calculated from matched peptide sequence | |
| pep_delta | pep_exp_mr - pep_calc_mr | |
| pep_score | Score of PSM | MaxQuant, Score. MSFragger,
Hyperscore. Spectrum Mill, score. |
| pep_homol | Mascot homology threshold score for PSM | |
| pep_ident | Mascot identity threshold score for PSM | |
| pep_expect | Expectation value or posterior error probability of PSM | MaxQuant, PEP. MSFragger,
Expectation. |
| pep_res_before | Flanking residue on N-term side of peptide | |
| pep_seq | One-letter representation of peptide sequences without variable modifications | |
| pep_seq_mod | pep_seq with variable modifications |
E.g._mAsGVAVSDGVIK with a methionine
oxidation and a serine phosphorylation. The acetylation of a protein
N-terminal is indicated by _. |
| pep_res_after | Flanking residue on C-term side of peptide | |
| pep_start | Ordinal position of first peptide residue in protein sequence | |
| pep_end | Ordinal position of last peptide residue in protein sequence | |
| pep_len | Number of amino acid residues in a peptide sequence | |
| pep_miss | Count of missed cleavage sites in peptide | |
| pep_istryptic | Is a peptide sequence a canonical tryptic peptide or not | |
| pep_semitryptic | Charateristics of semi-tryptic peptides | |
| pep_frame | Mascot calculation of translation frame number | |
| pep_var_mod | Mascot variable modifications from all sources as list of names | |
| pep_var_mod_pos | Mascot variable modifications as a string of digits, e.g. 0.0001000.0?. Non-zero digits identify mods according to key in export header. First and last positions are for terminus mods. | |
| pep_summed_mod_pos | Mascot: when two variable modifications occur at the same site, a string of digits defining the second mod | |
| pep_local_mod_pos | Mascot query-level variable modifications as a string of digits. The names of the mods will be listed in pep_var_mod | |
| pep_num_match | Mascot count of fragment ion matches in ion series used to calculate the score | |
| pep_scan_title | Scan title taken from peak list | |
| pep_scan_range | Mascot scan number range of a peptide match | proteoM: pep_scan_num |
| pep_ret_range | Range of LCMS retention times of a peptide match | |
| pep_ret_sd | Standard deviation of pep_ret_range |
The levels of peptide uniqueness are according to
group_psm_by. |
| pep_ms2_sumint | Mascot sum of MS2 fragment-ion intensity of a peptide match | |
| pep_n_ions | Mascot count of matched and unmatched fragment ions of a peptide query | |
| pep_locprob | The highest probablity from site analysis of the variable modification sites | The second highest probablity,
pep_locprob2 is made implicit through
pep_locdiff. Cf. pep_var_mod_conf from
Mascot. |
| pep_locdiff | pep_locprob - pep_locprob2 | |
| pep_phospho_locprob | The highest probability in phosphorylation sites | |
| pep_phospho_locdiff | The difference between the highest and the second highest probability of phosphorylation sites | |
| pep_ions_first | Mascot: the first series of ions being matched under a PSM query | Need enabled MS/MS peak lists during PSM
exports |
| pep_ions_second | Mascot: the second series of ions being matched under a PSM query | v.s. |
| pep_ions_third | Mascot: the third series of ions being matched under a PSM query | v.s. |
| pep_n_ms2 | Number of entries in a peak list | |
| pep_scan_num | MS scan number of peptide | |
| pep_mod_group | Index of peptide modification group | |
| pep_fmod | Fix modifications of peptide | |
| pep_vmod | Variable modificaiton of peptide | |
| pep_isdecoy | Is peptide identication a decoy or not | |
| pep_ivmod | Variable Anywhere modificaiton of peptide as a string of hexcodes | Terminal modifications are not part of the string. proteoM: ordianl number in square brackets indicating neutral losses for fixed modifications and in round parentheses for variable modifications; the numbers are removed during proteoQ. |
| pep_issig | Is peptide identication significant or not | |
| pep_score_co | Peptide enrichment score cut-off | |
| pep_rank_nl | Ordinal number of neutral loss under the same peptide
including modifications (pep_seq_mod) |
|
| pep_ms2_moverzs | Experimental MS2 m/z values | |
| pep_ms2_ints | Experimental MS2 intensity values | |
| pep_ms2_theos | Theoretical MS2 m/z values (primary ions) | |
| pep_ms2_theos2 | Theoretical MS2 m/z values (secondary ions) | |
| pep_ms2_deltas | Mass deltas between experimentals and theoreticals (primary) | |
| pep_ms2_ideltas | The indexes of pep_ms2_deltas along the sequence of theoretical MS2 ions (primary) | |
| pep_ms2_deltas2 | Mass deltas between experimentals and theoreticals (secondary) | |
| pep_ms2_ideltas2 | The indexes of pep_ms2_deltas along the sequence of theoretical MS2 ions (secondary) | |
| pep_ms2_deltas_mean | The mean of pep_ms2_deltas | |
| pep_ms2_deltas_sd | The standard deviation of pep_ms2_deltas | |
| gene | Protein gene name | |
| fasta_name | Protein name taken from Fasta title line | By default, the character string before the first white
space. See also ?read_fasta for additional options. |
| uniprot_acc | Protein UniProt accession | |
| uniprot_id | Protein UniProt entry name | |
| refseq_acc | Protein RefSeq accession | |
| other_acc | Protein accession with formats other than UniProt or RefSeq | |
| entrez | Protein Entrez ID | |
| species | Species of a protein entry | |
| acc_type | Type of accession names | One of refseq_acc,
uniprot_acc, uniprot_id or
other_acc. |
| shared_prot_accs | List of protein accesssions under the same group | MSFragger at UniProt fasta(s): fasta_name(s) being used for non-primary protein(s). |
| shared_genes | List of genes under the same group | |
| kin_attr | Is protein with an attribute of kinase or not | |
| kin_class | Class of kinases, e.g., TK, TKL… | |
| kin_order | Order of “kin_class” from the kinase tree diagram | |
| dat_file | File name of PSM results | |
| … | Additional column keys | |
| raw_file | MS file name where peptides or proteins are identified | |
| I126 etc. | Reporter-ion intensity from MS/MS ion search; I000 in LFQ | |
| N_I126 etc. | Normalized I126 etc.; N_I000 in LFQ | The calibration factors for the alignment of
log2R... are used to scale the reporter-ion intensity
across samples. |
| R126 etc. | Linear FC relative to TMT-126; R000 in LFQ | |
| sd_log2_R126 etc. | Standard deviation of peptide log2FC; sd_log2_R000 in LFQ | Calculated from contributing PSMs under each TMT channel. |
| log2_R126 etc. | log2FC in relative to the average intensity of reference(s) under each multiplex TMT; log2_R000 in LFQ | Relative to the row-mean intensity within each multiplex TMT if no reference(s) are present. |
| N_log2_R126 etc. | Median-centered log2_R…; N_log2_R000 in LFQ |
Prior to significance tests, the primary peptide outputs with and
without the imputation of NA values are summarized in
Peptide.txt and Peptide_impNA.txt,
respectively. The column keys therein are described in the
following:
| Header | Descrption | Note |
|---|---|---|
| prot_acc | Protein accession string | |
| prot_desc | Protein description taken from Fasta title line | |
| prot_mass | Protein mass | |
| prot_len | Number of amino acid residues under a proposed protein | |
| prot_pi | Mascot calculation of protein isoelectric point | |
| prot_tax_str | Mascot protein taxonomy as string | |
| prot_tax_id | Mascot protein taxonomy as Tax ID | |
| prot_seq | Mascot protein sequence in 1 letter code | |
| prot_empai | Mascot empirical protein abundance index | |
| prot_icover | Protein sequence coverage by counts of tryptic peptides | See also PSM keys. |
| prot_cover | Coverage of protein sequence by amino acid residues | v.s. |
| prot_issig | Protein identification is significant or not | |
| prot_isess | Protein identification is essential or not | |
| prot_tier | Protein tier | |
| prot_es | Protein enrichment score | |
| prot_es_co | Protein enrichment score cut-off | |
| prot_n_psm | Count of significant PSMs in quantitation under a proposed protein | Joint results from individual PSM tables; the counts exclude entries that are filtered by users. |
| prot_n_pep | Count of significant peptide sequences in quantitation under a proposed protein | v.s. |
| pep_n_psm | Counts of significant PSMs in quantitation under a proposed peptide | v.s. |
| pep_isunique | Peptide sequence is unique or not. | See also PSM keys for levels of uniqueness |
| pep_literal_unique | Peptide sequence is unique to hit or not. | v.s. |
| pep_razor_unique | Peptide sequence is unique to group or not. | v.s. |
| pep_tot_int | Total MS1 intenisty of a peptide sequence | Sum statistics from PSMs |
| pep_unique_int | Unique MS1 intenisty of a peptide sequence | v.s. |
| pep_razor_int | Razor MS1 intenisty of a peptide sequence | v.s. |
| pep_res_before | Flanking residue on N-term side of peptide | |
| pep_seq | One-letter representation of peptide sequences without variable modifications | |
| pep_seq_mod | pep_seq with variable modifications |
|
| pep_res_after | Flanking residue on C-term side of peptide | |
| pep_start | Ordinal position of first peptide residue in protein sequence | |
| pep_end | Mascot: ordinal position of last peptide residue in protein sequence | |
| pep_len | Number of amino acid residues in a peptide sequence | |
| pep_miss | Count of missed cleavage sites in peptide | |
| pep_istryptic | Is a peptide sequence a canonical tryptic peptide or not | |
| pep_semitryptic | Charateristics of semi-tryptic peptides | |
| pep_frame | Mascot calculation of translation frame number | |
| pep_isdecoy | Is peptide identication a decoy or not | |
| pep_issig | Is peptide identication significant or not | |
| pep_score_co | Peptide enrichment score cut-off | Median description from PSMs. |
| pep_score | Score of peptide | v.s. |
| pep_expect | Expectation value or posterior error probability of PSM(s) | Geometric-mean description from PSMs. |
| pep_n_exp_z | Number of unique charge states under a peptide sequence | Median description from PSMs. The levels of peptide
uniqueness are according to group_psm_by. |
| pep_ret_range | Range of LCMS retention times of a peptide match | v.s. |
| pep_ret_sd | Standard deviation of pep_ret_range across
samples and LCMS series |
|
| pep_locprob | The highest probablity from site analysis of the variable modification sites | Median description from PSMs. |
| pep_locdiff | pep_locprob - pep_locprob2 | See also PSM keys. |
| pep_phospho_locprob | The highest probability in phosphorylation sites | |
| pep_phospho_locdiff | The difference between the highest and the second highest probability of phosphorylation sites | |
| pep_mod_protnt | Is if a sequence contains protein N-terminal modification | Optional at
normPSM(group_psm_by = pep_seq_mod, use_lowercase_aa = TRUE).
See also. ?proteoQ::normPSM. |
| pep_mod_protntac | Is a sequence contains protein N-terminal acetylation | v.s. |
| pep_mod_pepnt | Is a sequence contains N-terminal modification | v.s. |
| pep_mod_m | Is a sequence contains methionine oxidation | v.s. |
| pep_mod_n | Is a sequence contains asparagine deamidation | v.s. |
| pep_mod_sty | Is a sequence contains the phospholyration of serine, threonine or tyrosine | v.s. |
| pep_mod_pepct | Is a sequence contains C-terminal modification | v.s. |
| pep_mod_protctam | Is a sequence contains protein C-terminal amidation | v.s. |
| pep_mod_protct | Is a sequence contains protein C-terminal modification | v.s. |
| pep_mean_raw | Mean log2_R (…) across samples | Reference and Empty samples
excluded. |
| pep_mean_n | Mean N_log2FC(…) across samples | v.s. |
| pep_mean_z | Mean Z_log2FC(…) across samples | v.s. |
| gene | Protein gene name | |
| fasta_name | Protein name taken from Fasta title line | See also PSM keys. |
| uniprot_acc | Protein UniProt accession | |
| uniprot_id | Protein UniProt entry name | |
| refseq_acc | Protein RefSeq accession | |
| other_acc | Protein accession with formats other than UniProt or RefSeq | |
| entrez | Protein Entrez ID | |
| species | Species of a protein entry | |
| acc_type | Type of accession names | |
| shared_prot_accs | List of protein accesssions under the same group | See also PSM keys. |
| shared_genes | List of genes under the same group | |
| kin_attr | Is protein with an attribute of kinase or not | |
| kin_class | Class of kinases, e.g., TK, TKL… | |
| kin_order | Order of “kin_class” from the kinase tree diagram | |
| mean_lint | Mean log10 intensity (N_I…) across samples | Reference and Empty samples
excluded. |
| count_nna | Count of non-NA log2FC | v.s. |
| … | Additional column keys | |
| I… (…) | Reporter-ion intensity; I000 in LFQ | Calculated from the descriptive statistics by
method_psm_pep in PSM2Pep() for indicated
samples. |
| N_I… (…) | Normalized I… (…); N_I000 in LFQ | The calibration factors for the alignment of log2FC are used to scale the reporter-ion intensity across samples. |
| sd_log2_R… (…) | Standard deviation of protein log2FC; sd_log2_R000 in LFQ | Calculated from contributing peptides under each sample. |
| log2_R… (…) | log2FC relative to reference materials for indicated samples; log2_R000 in LFQ | Before normalization. |
| N_log2_R… (…) | Aligned log2_R… (…) according to method_align in standPep() without scaling normalization; N_log2_R000 in LFQ | |
| Z_log2_R… (…) | N_log2_R… (…) with scaling normalization; Z_log2_R000 for LFQ |
Following optional significance tests, the primary peptide outputs
with and without the imputation of NA values are summarized in
Peptide_pVals.txt and Peptide_impNA_pVals.txt,
respectively. Additional column keys include p-values, log2FC and linear
FC, with the incorporation of contrast and formula names that have been
specified in calls to pepSig.
Prior to significance tests, the primary protein outputs with and
without the imputation of NA values are summarized in
Protein.txt and Protein_impNA.txt,
respectively. The corresponding column keys are described in the
following:
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 40 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): Header, Descrption, Note
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Header | Descrption | Note |
|---|---|---|
| prot_acc | Protein accession string | NA |
| prot_desc | Protein description taken from Fasta title line | NA |
| prot_mass | Protein mass | NA |
| prot_len | Number of amino acid residues under a proposed protein | NA |
| prot_pi | Mascot calculation of protein isoelectric point | NA |
| prot_tax_str | Mascot protein taxonomy as string | NA |
| prot_tax_id | Mascot protein taxonomy as Tax ID | NA |
| prot_seq | Mascot protein sequence in 1 letter code | NA |
| prot_empai | Mascot empirical protein abundance index | NA |
| prot_icover | Protein sequence coverage by counts of tryptic peptides | NA |
| prot_cover | Protein sequence coverage | NA |
| prot_n_psm | Count of significant PSMs in quantitation under a proposed protein | NA |
| prot_n_uniqpsm | Count of unique, significant PSMs in quantitation under a proposed protein | NA |
| prot_n_pep | Count of significant peptide sequences in quantitation under a proposed protein | NA |
| prot_n_uniqpep | Count of unique, significant peptide sequences in quantitation under a proposed protein | NA |
| prot_tot_int | Total MS1 intenisty of peptides under a proposed protein | Sum statistics from peptide intensities |
| prot_unique_int | Unique MS1 intenisty of peptides under a proposed protein | v.s. |
| prot_razor_int | Razor MS1 intenisty of peptides under a proposed protein | v.s. |
| prot_mean_raw | Mean log2_R (…) across samples | NA |
| prot_mean_n | Mean N_log2FC(…) across samples | NA |
| prot_mean_z | Mean Z_log2FC(…) across samples | NA |
| gene | Protein gene name | NA |
| fasta_name | Protein name taken from Fasta title line | NA |
| uniprot_acc | Protein UniProt accession | NA |
| uniprot_id | Protein UniProt entry name | NA |
| refseq_acc | Protein RefSeq accession | NA |
| other_acc | Protein accession with formats other than UniProt or RefSeq | NA |
| entrez | Protein Entrez ID | NA |
| species | Species of a protein entry | NA |
| acc_type | Type of accession names | NA |
| kin_attr | Is protein with an attribute of kinase or not | NA |
| kin_class | Class of kinases, e.g., TK, TKL… | NA |
| kin_order | Order of kin_class from the kinase tree diagram | NA |
| mean_lint | Mean log10 intensity (N_I…) across samples | NA |
| count_nna | Count of non-NA log2FC | NA |
| I… (…) | Reporter-ion intensity; I000 in LFQ | According to the descriptive statistics in Pep2Prn(method_pep_prn = …) |
| N_I… (…) | Normalized I… (…); N_I000 in LFQ | NA |
| log2_R… (…) | log2FC relative to reference materials for indicated samples; log2_R000 in LFQ | NA |
| N_log2_R… (…) | Aligned log2_R… (…); N_log2_R000 in LFQ | According to standPrn(method_align = …) without scaling normalization |
| Z_log2_R… (…) | N_log2_R… (…) with scaling normalization; Z_log2_R000 in LFQ | NA |
Analogous to the peptide tables, the primary protein outputs with and
without the imputation of NA values are summarized in
Protein_pVals.txt and Protein_impNA_pVals.txt,
respectively.
See notes for exporting PSMs from Mascot.
See notes here.
To cite this work: (2019) R package proteoQ for Quantitative Proteomics Using Tandem Mass Tags or Label-free Approaches. https://github.com/qzhang503/proteoQ.↩︎
If not, try `devtools::install_github(“qzhang503/proteoQDA”)`.↩︎
To extract the names of RAW MS files under a `raw_dir` folder: `extract_raws(raw_dir)`. Very occasionally, there may be RAW MS files without PSM contributions. In this case, the file names will be shown as missing by the program and need to be removed from `expt_smry.xlsx` or `frac_smry.xlsx`. The function `extract_psm_raws()` was developed to extract the list of RAW files that are actually present in PSM files.↩︎
On top of technical variabilities, the ranges of CV may be further subject to the choice of reference materials. Examples are available in Lab 3.1.↩︎
`standPep()` will report log2FC results both before and after the scaling of standard deviations.↩︎
The default is `scale_log2r = TRUE` throughout the package. When calling functions involved parameter `scale_log2r`, we may specify explicitly `scale_log2r = FALSE` if needed, or more preferably define its value under the global environment.↩︎
This will work as GO terms of human start with `hs_` and KEGG terms with `hsa`.↩︎
Details on the notation of peptide modifications can be found via `?normPSM`.↩︎